US20200311844A1 - Identifying duplicate user accounts in an identification document processing system - Google Patents
Identifying duplicate user accounts in an identification document processing system Download PDFInfo
- Publication number
- US20200311844A1 US20200311844A1 US16/832,726 US202016832726A US2020311844A1 US 20200311844 A1 US20200311844 A1 US 20200311844A1 US 202016832726 A US202016832726 A US 202016832726A US 2020311844 A1 US2020311844 A1 US 2020311844A1
- Authority
- US
- United States
- Prior art keywords
- user accounts
- similarity score
- user
- similarity
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
- G06Q50/265—Personal security, identity or safety
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
- G06F18/2178—Validation; Performance evaluation; Active pattern learning techniques based on feedback of a supervisor
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G06K9/00288—
-
- G06K9/00456—
-
- G06K9/6215—
-
- G06K9/6263—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/02—Affine transformations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/2866—Architectures; Arrangements
- H04L67/30—Profiles
- H04L67/306—User profiles
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/24—Aligning, centring, orientation detection or correction of the image
- G06V10/247—Aligning, centring, orientation detection or correction of the image by affine transforms, e.g. correction due to perspective effects; Quadrilaterals, e.g. trapezoids
Definitions
- the present invention generally relates to the field of identifying fraudulent user accounts in a system, and more specifically, to detection of duplicate user accounts using deep learning.
- User accounts for various applications are primarily based on information input from associated users.
- the system may request the user to enter personal information, such as name, address, or phone number, according to an embodiment.
- personal information such as name, address, or phone number
- the user may be required to upload an image of their identification document, used by the system to verify their identity. The system can then use the identification document to ensure that no duplicate user accounts are created for one user.
- Systems and methods are disclosed herein for identifying duplicate user accounts. These systems and methods may be applied using user identifying information, such as name, address, or image of a user, which may be preprocessed to fix any errors with the image (i.e., distortion and/or orientation).
- the system creates a graph representing the user accounts as nodes connected by edges representing similarity scores between each pair of nodes and iterates upon removing edges representing the similarity scores below a threshold amount until only connected components of nodes for similar user accounts remain.
- the system receives a plurality of user accounts and determines similarity scores indicating the similarity between each pair of user accounts in the plurality of user accounts.
- the system determines an initial threshold similarity score indicative of a particular degree of similarity between user accounts.
- the system repeats a set of steps for a plurality of iterations where each iteration has a threshold similarity score and the threshold similarity score is initialized to the initial threshold similarity score.
- the steps include determining one or more connected components of a graph of nodes and edges where each node represents a user account and a pair of nodes has an edge if the similarity score of the pair of nodes indicates a greater degree of similarity than indicated by the threshold similarity score and modifying the threshold similarity score for the next iteration to a value indicative of a higher degree of similarity between user accounts compared to the threshold similarity score for the current iteration.
- the system repeats the set of steps for of the plurality of iterations. For example, the system may repeat the process for a fixed number of iterations. Alternatively, the system may repeat the process until two subsequent iterations result in the same set of connected components.
- the system may repeat the process until an aggregate measure of sizes of the connected components indicates that the connected components are small and have an aggregate size below a threshold value. Responsive to repeating the steps in the set for the plurality of iterations, the system identifies one or more connected components each representing a set of user accounts for a particular user. The system transmits information describing the identified one or more connected components to a client device, for example, to a privileged user. In some embodiments, the system may remove one or more edges with a similarity score indicative of a degree of similarity less than the modified threshold similarity score.
- FIG. 1 illustrates a computing environment for processing documents, according to one embodiment.
- FIG. 2 is a high-level block diagram illustrating a detailed view of the image orientation module, according to one embodiment.
- FIG. 3 is a high-level block diagram illustrating a detailed view of the duplicate user account detection module, according to one embodiment.
- FIG. 4 is a flowchart illustrating the process for image transformation from an identification document, according to one embodiment.
- FIG. 5 illustrates photo orientation corrections, according to one embodiment.
- FIG. 6A illustrates a generic photo orientation correction, according to one embodiment.
- FIG. 6B illustrates transforming the identification document to change the shape of the identification document, according to one embodiment.
- FIG. 7 illustrates example images of identification documents, according to one embodiment.
- FIG. 8 is a flowchart illustrating the process for identifying duplicate user accounts, according to one embodiment.
- FIG. 9A illustrates examples of connected components of user accounts, according to one embodiment.
- FIG. 9B illustrates an example of forming connected components after modifying the threshold value for the threshold similarity scores.
- FIG. 10A illustrates an example of a set images from identification documents at one threshold value, according to one embodiment.
- FIG. 10B illustrates an example of subsets of the set of images from FIG. 10A , the subsets of the set of images having a greater degree of similarity than the set of images from FIG. 10A , according to one embodiment.
- FIG. 11 is a high-level block diagram illustrating physical components of a computer used as part or all of the client device from FIG. 1 , according to one embodiment.
- This application details a technique for determining duplicate user accounts within a system.
- User accounts may include user identifying information, such as name, address, or an image of an identification document. Examples of identification documents include driver's licenses, passports, or any other government-issued identification document.
- the system employs a preprocessing technique to determine similarities between document images from separate user accounts. The system determines degrees of similarity between user accounts and may then employ a clustering algorithm to reduce the rate of false positives in similarity matches, improving the precision of the system in identifying duplicate user accounts.
- FIG. 1 illustrates a computing environment for processing documents, according to one embodiment.
- the computing environment, or system includes a client device 110 , network 120 , and server 100 , according to an embodiment. These various components are now described in additional detail.
- the client device 110 is a computing device such as a smart phone, laptop computer, desktop computer, or any other device that can access the network 120 and, subsequently, the server 100 .
- the network 120 may be any suitable communications network for data transmission.
- the network 120 uses standard communications technologies and/or protocols and can include the Internet.
- the entities use custom and/or dedicated data communications technologies.
- the network 120 connects the client device 110 to the server 100 .
- the server 100 comprises an image orientation module 130 , a user account store 160 , and a duplicate user account detection module 140 .
- the server 100 processes identification documents (i.e., “documents,” for simplicity).
- the image orientation module 130 detects orientation of a document in a given image, for example, the position at which the document is placed and the orientation and distortion of the document in a given image.
- the duplicate user account detection module 140 detects whether two user accounts are duplicates, i.e., user accounts of the same user.
- the user account store 160 stores user accounts within the system.
- a user account is associated with a user of the system and contains identifying information such as name, address, phone number, and images of identification documents, according to an embodiment. In other embodiments, user accounts may include more or less information about users associated with the accounts.
- FIG. 1 shows one possible configuration of the system.
- there may be more or fewer systems or components for example, there may be multiple servers 100 or the server 100 may be composed of multiple systems such as individual servers or load balancers.
- FIG. 2 is a high-level block diagram illustrating a detailed view of the image orientation module, according to one embodiment.
- the image orientation module 130 receives an image of a document and determines the orientation and distortion of the document in the given image, for example, orientation of an identification document in an image submitted to the server 100 for evaluation.
- the orientation of the document may include the angle of rotation, the direction of rotation, and the position of the document in the image.
- the various components of the image orientation module 130 include, but not limited to, a document store 210 , an image transformation module 250 , an image detection module, a text detection module 240 , a neural network 260 , and a training module 270 , according to one embodiment. In other embodiments, there may be other components not shown in FIG. 2 .
- the document store 210 stores images of documents, for example, electronic copies of physical identification documents associated with a user.
- identification documents include passport, driver license, national or state ID, and so on.
- Identification documents may be used to verify the identity of a user.
- Some examples of identification documents include drivers' licenses and passports.
- a user uploads an image to the system of the user's identification document.
- an identification document includes an image and text.
- the text may include information describing the user, for example, the user's name, the user's address, the user's date of birth, date the document was issued, date of expiry of the document, and an identification number.
- the image in the document is typically an image of the user, for example, a photo of the user.
- the image transformation module 250 processes the image of an identification document associated with a user.
- the image transformation module 250 detects the identification document within the image and transforms the image to reorient and/or scale the identification document within the image, according to an embodiment.
- an identification document may be distorted or rotated relative to the orientation of the image itself. These orientations are further described in FIG. 6 .
- the image detection module 230 receives for processing, an input image comprising a document and detects one or more images within the bounds of the document.
- an image detected within the document is of a user identified by the identification document, for example, an image of the face of the user.
- the image detected within the document may be stored in relation to the user and user account in the user account store 160 .
- image detection module 230 detects other images within the document and stores them in the user account store 160 . Examples of possible images that are detected include images of the user's signature, the background of the identification document, and the shape of a geographical region associated with the identification document, like a state or province.
- the image detection module 230 also records the location of an image in a document, the size of the image, orientation of the image, relative positions of two images, relative position of an image and a text, and other parameters describing the image or images within the document. In some embodiments, the image detection module 230 uses these features of the image to determine the location of the document in the image, so that the document may be transformed. For example, certain types of identification documents have the user's image at a particular location within the document, having a particular size with respect to the size of the document and a particular orientation with respect to the identification document or with respect to one or more text fields or text snippets present in the document. In an embodiment, the image detection module 230 uses these features to detect the type of identification document and the correct orientation of the identification document. Accordingly, the duplicate user account detection module 140 extracts the features describing the images and provides them as input to a machine learning model.
- the machine learning model is trained using training data set comprising images of various identification documents.
- the machine learning model is configured to determine scores indicative of the parameters describing the document within an input image, for example, scores indicative of the position of the document within the image, orientation of the document, and so on.
- the neural network 260 may automatically infer these parameters and use them for detecting type of the identification document, orientation of the identification document, and other type of information describing the document.
- the machine learning model determines a boundary of the input image and uses the boundaries to determine the parameters.
- the text detection module 240 detects text within the transformed images of identification documents. Text may include, according to some embodiments, a user's name, address, identification number, or other identification information. In other embodiments, the text detection module 240 detects text boundaries instead of or in conjunction with text. The detected text may be stored in the user account store 160 in relation to the associated user and user account. In some embodiments, the text from the transformed images is compared to information contained within the user account to verify the validity of the information. The text detection module 240 also records the location of text in a document, which may be used to determine the location and the orientation of the document within the image.
- the text detection module 230 performs optical character recognition (OCR) to recognize certain snippets of the text, for example, “Name”, “Date of birth”, “Address”, and so on.
- OCR optical character recognition
- the image detection module 230 generates features based on positions of these text snippets within the document, for example, relative to a particular corner of the document.
- the image detection module 230 generates features based on relative positions of various text snippets, for example, the position of the “Address” snippet with respect to the “Name” snippet and so on.
- the image detection module 230 generates features based on relative positions of images within the document compared to various text snippets, for example, the position of the image of the user with respect to the “Name” snippet, or the position of a logo in the identification document with respect to the “Address” snippet, and so on.
- the image orientation module 130 provides these features as input to a machine learning model, for example, the neural network 260 .
- the neural network 260 is configured to receive an encoding of an image as input and predict one or more values describing the document within the input image, for example, scores indicative of the position of the document within the input image, the orientation, or the document within the input image, and so on.
- the neural network 260 is a deep neural network with one or more hidden layers. The hidden layers determine features of the input image that are relevant to predicting the above scores.
- the neural network receives an encoding of an input image that is transformed by layers of artificial neurons, where the inputs for neurons at a given layer come from the previous layer, and all of the outputs for a neuron are provided as input to the subsequent layer.
- the neural network 260 comprises an input component that provides input to a plurality of output components, each output component configured to predict a particular parameter value describing the document in an input image.
- the neural network 260 is a convolutional neural network.
- the neural network is included in the image transformation module 250 .
- the training module 270 trains the neural network using, for example, supervised learning techniques based on labelled training data comprising various images and their respective parameters values.
- the training of the neural network 260 is based on a back propagation process that adjusts the weights of the neural network to minimize an aggregate measure of error between predicted parameter values and actual parameter values provided in the training data.
- the training process may be repeated for each image provided in a training dataset.
- the techniques disclosed are applicable to various types of machine learning based models that may or may not be based on deep learning, for example, decision tree based models, random forest based models, logistic regression based models, and so on.
- FIG. 3 is a high-level block diagram illustrating a detailed view of the duplicate user account detection module, according to one embodiment.
- the duplicate user account detection module 140 identifies similar user accounts within the system.
- the various components of the duplicate user account detection module 140 include, but are not limited to, a node store 310 , an edge store 320 , a connected component store 330 , an edge remover module 340 , and an edge determination module 350 , according to one embodiment. In other embodiments, there may be other components not shown in FIG. 3 .
- the duplicate user account detection module 140 maintains a graph comprising nodes representing user accounts and edges representing associations between pairs of nodes that represent similar user accounts.
- the duplicate user account detection module 140 transforms the graph by iteratively modifying the graph as described herein.
- the node store 310 stores nodes associated with user accounts within the system.
- the user accounts are associated with information and images that may be used to identify a user.
- the edge store 320 stores edges between nodes.
- the edges are associated with similarity scores between nodes, wherein a similarity score indicates the degree of similarity between a set of nodes. An edge exists between a pair of nodes if the similarity score between the pair of nodes exceeds a threshold value.
- the edge determination module 350 determines the edges between sets of nodes.
- the edge determination module 350 compares the user accounts associated with a set of nodes to determine a similarity score indicating a degree of similarity between the user accounts.
- the edge determination module 350 compares the information of the user accounts to determine duplicate information.
- the information of the user accounts may be entered by a user or may be gathered from the text detected from the user's associated identification document.
- the edge determination module 350 compares the images from identification documents associated with user accounts for similarity by comparing the location of pixels within the images. For example, the edge determination module may use facial recognition between images on identification documents to determine if the identification documents (and therefore user accounts) represent the same user.
- the edge determination module may convert images of user's faces on identification documents into an embeddings (i.e., multi-dimensional vectors describing characteristics of the faces) and use distance between vectors to determine a degree of similarity between user's faces of different user accounts.
- a neural network may be used to determine similarity between user accounts. For example, the neural network may be trained on labelled sets of known duplicate user accounts to determine a degree (or percentage) of similarity, represented as a similarity score, between user accounts based on embeddings describing the users' faces. The similarity score is stored in association with the edge.
- the connected component store 330 stores the nodes and edges that form a connected component. Connected components indicate a high degree of similarity between the nodes in the connected components.
- the nodes of the connected components are interconnected to multiple other nodes within the connected component, according to some embodiments.
- the node store 310 , edge store 320 , and connected component store 330 are combined such that nodes, edges, and connected components are stored together.
- the techniques disclosed herein determine connected components of the graph to identify duplicate user accounts
- the system may use other techniques for dividing a graph into subgraphs representing duplicate user accounts. For example, some embodiments may use clustering algorithm to divide a graph into clusters of nodes based on certain criteria, for example, a measure of connectivity between nodes of the cluster. Each cluster determines by such a process comprises duplicate user accounts.
- the edge remover module 340 removes edges between nodes. In an embodiment, the edge remover module simply associates an edge with a flag indicating that the edge is removed. In other embodiments, the edge remover module 340 deletes a representation of the edge from a data structure representing the graph. The edge remover module 340 determines a threshold similarity score. The edge remover module 340 compares the similarity scores associated with edges to the threshold score and removes edges with a similarity score that indicates a lower degree of similarity than the threshold score. The threshold score may be updated to indicate higher similarity between nodes as the process is performed to remove more edges to find nodes with higher similarity.
- the threshold score value is updated to indicate higher similarity
- the number of edges of the graph decreases since edges indicative of similarity less that the degree of similarity corresponding to the threshold score are removed.
- the number of connected components of the graph increases and the average size of connected components decreases.
- FIG. 4 is a flowchart illustrating the process for image transformation from an identification document, according to one embodiment.
- the system gathers 400 an image of an identification document.
- the image is input by a user of the system to show proof of and verification of their identity through an identification document.
- the user may enter the image to the system directly from a camera on the client device 110 .
- the image is stored in relation to a user account store of the user in the user account store 160 .
- the system provides 410 the image of the identification document as input to the neural network 260 , which determines using the image detection module 230 and the text detection module 240 parameters describing the document, for example, the location of the document within the image and the orientation of the identification document in the image.
- the system uses the neural network 260 to determine 420 the position and the orientation of the identification document in the image.
- the neural network 260 is further configured to determine 420 the bounding box and aspect ratio of the image. For example, a certain point of the image may be considered as the origin and the position of the document may be determined as the coordinates of a point of the document such as a corner.
- the system may represent the orientation of the image using an angle. For example, certain orientation of the document may be considered as a default orientation and any other orientation may be represented using an angle by which the document needs to be rotated to reach that orientation.
- the system may further represent dimensions of the document using a scaling factor. For example, a particular size of the document may be considered as the default size.
- the system stores a scaling factor indicating the actual size of the document compared to the default size.
- the system transforms 430 the images to change the parameters of the identification document to standardized values.
- the system further extracts areas of interest from the image, for example, the system may extract a portion of the image that shows the document if the image includes objects or patterns in the background other than the document.
- the neural network may be configured to receive an input image of a document and output parameters for transforming the image of the identification document to generate an image displaying the identification document in a canonical form.
- the image is received from a user.
- the system may send the transformed image to the client device 110 of the user associated with the image or an administrator, for display.
- the system transforms the images to change the location and orientation and fix the distortion of the identification document to a canonical form.
- a particular point of the image is considered an origin, for example, the lower left corner.
- a particular orientation of the identification document is considered a canonical orientation, for example, the orientation in which the image of the person identified is displayed in a position in which the head of a person standing upright would face the viewer of the image.
- the identification document has edges parallel to the edges of the image.
- the position of the identification document is such that the lower left corner of the identification document is within a threshold of the lower left corner of the image when displayed on a display screen.
- the lower left corner of the identification document may overlap with the lower left corner of the image.
- the identification document has a size that is within a threshold percentage of the size of the image, for example, the dimensions of the identification document are at least 80% of the dimensions of the image.
- the shape of the identification document in a canonical form is rectangular.
- the image transformation module 250 performs geometric transformation of the identification document such that the transformed identification document is in a canonical form, also known as fixing the distortion of the identification document.
- the image transformation module 250 may enlarge the identification document if the identification document in the image is below a threshold size; the image transformation module 250 may move the position of the identification document within the image to bring the identification to a canonical position; the image transformation module 250 may rotate the identification document in the image to change the orientation of the identification document to a canonical orientation; and if the identification document is not in a rectangular shape, the image transformation module 250 may stretch the identification document such that one side of the document is increased in size more than another size to transform the identification document to a rectangular shape.
- FIG. 5 illustrates photo orientation corrections, according to one embodiment.
- the identification document, or document 500 is a driver's license, or “Driver ID.”
- the document 500 may have different orientations, such as document 500 A, document 500 B, and document 500 C.
- Document 500 A shows an embodiment where the document is orientated 90 degrees to the left from the canonical (or standardized) orientation shown in the embodiment of document 500 D.
- Document 500 B shows an embodiment where the document is orientated 180 degrees from document 500 D.
- Document 500 A shows an embodiment where the document is orientated ninety degrees to the right of document 500 D.
- the angle of the orientation that differs from the canonical orientation of document 500 D may be any angle between 0 and 360 degrees.
- the image transformation module 250 After the orientation of the document 500 is detected using the image detection module 230 and the text detection module 240 , the image transformation module 250 performs a photo orientation correction, as shown in the figure, to transform the orientation of document 500 A, document 500 B, and document 500 C to the orientation of document 500 D.
- FIG. 6A illustrates a generic photo orientation correction, according to one embodiment.
- the identification document, or document 600 A is a driver's license, or “Driver ID.”
- the document 600 A includes an image 610 A that depicts a user associated with the document 600 A.
- the document 600 A is rotated an at angle 620 , ⁇ , of a value between 0 and 360 degrees from a standardized orientation, such as the orientation of document 500 D shown in FIG. 5 .
- the image 610 A is also rotated by the same angle 620 from the standardized orientation as the rest of the document 600 A. This information may be used by the image detection module to determine the orientation of the document 600 A once it has determined the location of document 600 A in an image.
- FIG. 6B illustrates transforming the identification document to change the shape of the identification document, according to one embodiment.
- the system detects that the identification document needs a correction based on the shape of the identification document in the image since the identification document is trapezoidal in shape with two unequal parallel sides rather than rectangular with equal parallel sides.
- the system performs the correction by transforming the identification document to stretch the dimensions of the document, thereby generating a rectangular identification document image.
- the system detects the boundary of the document 600 B and performs the correction based on the dimensions of the boundary.
- the identification document, or document 600 is a driver's license, or “Driver ID.”
- the document 600 B includes an image 610 B that depicts a user associated with the document 600 B. In the embodiment depicted in FIG.
- the document is rotated into the image, such that rotating the document out, where out is depicted in the direction of the arrows, would transform the document 600 B to the canonical shape (i.e., rectangular shape).
- the image 610 B is also rotated inward by the same amount from the standardized orientation as the rest of the document 600 B. This information is used by the image detection module to determine the orientation of the document 600 B once it has determined the location of document 600 B in an image.
- FIG. 7 illustrates example images of documents, according to one embodiment.
- the example images are oriented at different example orientations, none of which are exactly the standardized orientation. Though some orientations may appear close to the standardized orientation of document 500 D depicted in FIG. 5 , the user placement of the identification documents in each image is slightly different, and therefore the images may have to be transformed for the identification documents to be in the standardized orientation.
- the background of each image is different, and the system only needs the document itself, not the background, which may be distracting based on patterns and objects included in the background. Therefore, in some embodiments, the image transformation module 250 removes the background of the image to leave only the identification document in the standardized orientation.
- the identification document is stored in association with the user account. Images of identification documents that have been transformed to canonical form can be compared with higher accuracy.
- the system uses image processing techniques for comparing images of faces of people to determine whether the images represent the face of the same person.
- the system uses machine learning based techniques, for example, deep learning based techniques for comparing images of faces of people to determine whether two images show the face of the same user.
- the system uses image comparison as well as comparison of user account information to determine whether two user accounts belong to the same user. This allows the system to identify duplicate user accounts and take appropriate user actions, for example, sending a message to the user to consolidate the user accounts or to disable at least some of the user accounts.
- Users may create multiple user accounts to bypass certain checks performed by the system based on policies. For example, if a user account is flagged as violating certain policy enforced by the system, the user may create an alternate account. Similarly, if the system enforces certain quota per user, a user may create multiple user accounts to game the system, thereby exceeding the quota.
- Embodiments of the invention detect duplicate user accounts to ensure that each user has a single user account, thereby enforcing the policies strictly.
- FIG. 8 is a flowchart illustrating the process for identifying duplicate user accounts, according to one embodiment.
- the system receives 800 a plurality of user accounts and, for each of a plurality of pairs of user accounts, determines 805 a similarity score indicative of similarity between a first user account and a second user account in the pair.
- the system determines 810 an initial threshold similarity score that is indicative of a particular degree of similarity between user accounts. This initial threshold similarity score is used to determine which user accounts are similar to one another and which user accounts are not similar to one another.
- the system repeats the following steps 815 and 820 for a plurality of iterations to determine connected components of user accounts, where connected nodes in a graph represent similar user accounts. Each iteration of the steps has a threshold similarity score, which is initialized to the initial threshold similarity score.
- the system determines 815 one or more connected components in the graph of nodes and edges.
- the nodes represent user accounts, and a pair of nodes has an edge is the similarity score of the pair of nodes indicates a greater degree of similarity than that indicated by the threshold similarity score.
- a greater number for a similarity score may indicate a greater degree of similarity.
- a smaller number for a similarity score may indicate a greater degree of similarity.
- the system modifies 820 the threshold similarity score for the next iteration to a value indicative of a higher degree of similarity between user accounts compared with the threshold similarity score for the current iteration.
- the system removes edges with a similarity score with a degree of similarity less than the modified threshold similarity score.
- the initial connected components include user accounts that may not be very similar but as the iterations proceed the user accounts in each connected component are more likely to be similar and represent duplicate user accounts.
- the system repeats the steps 815 and 820 until the user accounts are within a certain degree of similarity. This may be determined by the size of connected components or the number of connected components or the value of the similarity score. For example, in some embodiments, the system repeats the steps 815 and 820 until the system can no longer remove edges from the connected components due to a high degree of similarity.
- the system Responsive to repeating the steps for a plurality of iterations, the system identifies 825 one or more connected components, where each connected component represents sets of user accounts for a particular user.
- the system stops the iterations based on certain criteria. For example, in an embodiment, the system repeats the process for a fixed number of iterations. In another embodiment, the system stops the iterations if there are no changes in the connected components between subsequent iterations or if the changes in the connected components are below a threshold amount between iterations. In some embodiments, the system stops the iterations if the number of connected components exceeds a threshold.
- the system determines whether an aggregate measure based on sizes of connected components reaches below a threshold, thereby indicating that the connected components are dense (i.e., most, if not all, nodes in the connected component are connected to one another).
- the system transmits 830 information describing the identified one or more connected components to a privileged user, for example, an analyst for verification.
- a privileged user for example, an analyst for verification.
- the system has connected components of user accounts that represent duplicate accounts with a high likelihood.
- the system uses a connection ratio threshold to determine whether to alter the threshold similarity score.
- the connection ratio threshold represents how dense a connected component is (i.e. the number of edges per number of nodes that must exist within a connected component for the system to indicate that the connected component likely contains duplicate user accounts).
- the connection ratio threshold may be specified by a user, for example, as a system configuration parameter specified by a system administrator.
- the system may analyze previous results to estimate a connection ratio threshold. For example, the system identifies various connected components determined during previous executions of the process illustrated in FIG. 8 . The system marks the nodes that were determined to represent duplicate user accounts at the end of execution of the process.
- the system For each connected component, the system identifies the number of edges in the connected component and determines whether the connected component contains duplicate user accounts. The system determines an aggregate measure of number of edges of connected components that contain duplicate user accounts.
- the connection ratio threshold is a value determined to be aggregate measure of ratios of number of edges of connected components containing duplicate user accounts and the size of the connected component as indicated by the number of nodes of the connected component.
- the system saves connected components with more edges as determined using the connection ratio threshold. These connected components may also be referred to as dense connected components, which contain nodes of likely duplicate user accounts.
- the system determines if a connected component is sparse, i.e., it has a small number of edges compared to the size of the connected component by comparing the ratio of the number of edges of the connected component and the number of nodes of the connected component with the connection ratio threshold. If the system determines that the ratio of the number of edges of the connected component and the number of nodes of the connected component is smaller than the connection ratio threshold, the system modifies the threshold similarity score representing the degree of similarity of connected components to break the connected components into smaller, denser connected components.
- a connected component with 5 nodes would have 10 edges if the connected component is fully-connected.
- a connection ratio of 7 may indicate that the connected component must be connected by at least 70% of the maximum number of edges. Accordingly, the system saves dense connected components with 7 or more edges and alters the threshold similarity score to remove some edges to divide a sparse connected component into smaller, denser connected components.
- the system calculates a connectivity ratio for each connected component and compares the connectivity ratio to the connection ratio threshold.
- the connectivity ratio may be a relationship between the number of edges, k, and nodes, n, in a connected component, as show in Equation 1.
- the connectivity ratio may be represented by Equation 2.
- the system modifies the threshold similarity score representing the degree of similarity of connected components to break the connected components into smaller, denser connected components.
- the system stores the connected component information by associating user accounts belonging to the same connected component.
- the system sends messages to users determined to have duplicate user accounts requesting the users to consolidate the user accounts or delete additional user accounts.
- the system disables one or more user accounts from a connected component. For example, the system identifies the oldest user account and keeps it active and disables all the remaining user accounts in the connected component.
- the system identifies the user account that is associated with the most level of activity and disables the remaining user accounts. The system may disable a user account by preventing the user from using the account unless the user provides additional authentication information or calls and talks to a customer service representative to provide authentication information.
- the user account store 160 maintains a table storing relations between user accounts that have been verified as belonging to distinct users. Each user account may have a user account identifier that uniquely identifies the user account and the table stores pairs of user account identifiers for user accounts that are verified as belonging to distinct users. Accordingly, an edge between two user accounts is removed (or never created when the graph is initialized) if the two user accounts have been previously verified as being distinct user accounts.
- FIG. 8 illustrates a number of interactions according to one embodiment
- the precise interactions and/or order of interactions may vary in different embodiments.
- the steps may only be repeated once for a particular threshold similarity score, according to some embodiments.
- the steps may be repeated until a threshold level of connected components of user accounts have been formed or some other threshold condition has been met.
- FIG. 9A illustrates examples of connected components of user accounts, according to one embodiment.
- a fully-connected connected component 900 has nodes that are connected to all other nodes in the connected component by edges. This type of connected component indicates that all of the user accounts associated with the nodes are within some degree of similarity to one another (i.e., directly connected to one another).
- Connected components 910 , 920 , 930 are examples of the low-quality connected components that need to be regenerated since not every node is connected by a degree of similarity.
- a star shape connected component 910 has a plurality of nodes all connected to one center node.
- a chain shape connected component 920 has a plurality of nodes connected in chains. This type of connected component indicates that nodes are connected to one another within a degree of similarity but are not all similar to one another within that degree. In some embodiments, each node in a chain shape connected component 920 is each only connected to two other nodes. In other embodiments, some nodes in the connected component may be connected to more than two nodes, but some nodes in the connected component must only be connected to two nodes maximum.
- a connected component of sub-components 930 connected nodes in separate connected components into one connected component. This indicates that the connected components are similar in some way. The connected components are connected by inside nodes, which are the nodes that connect the connected components to one another. In some embodiments, there may be more than one pair of inside nodes connecting a connected component.
- FIG. 9B illustrates an example of forming connected components after modifying the threshold value for the threshold similarity scores.
- a larger number indicates a greater degree of similarity.
- FIG. 9B example depicts a connected component of sub-components 940 A, wherein the edges in the connected component have a similarity score greater than 0.23.
- edges are removed from the connected component of sub-components 940 A. This results in the removal of an inside edge that was connecting two connected components, connected component 940 B and connected component 940 C, together.
- Connected component 940 B and connected component 940 A have nodes with a higher degree of similarity than the nodes in the connected component of sub-components 940 A.
- FIG. 10A illustrates an example of a set images from documents at one threshold value, according to one embodiment.
- a larger number indicates a greater degree of similarity.
- Each image in connected component 1000 A represents a node and depicts a user associated with a different user account.
- the degree of similarity between the images is 0.23214.
- FIG. 10B illustrates an example of subset of the set of images from FIG. 10A , the subsets of the set of images having a greater degree of similarity than the set of images from FIG. 10A , according to one embodiment.
- the degree of similarity between the images in connected component 1000 B is 0.5
- the degree of similarity between the images in connected component 1000 C is 0.5.
- the images in connected component 1000 B appear to depict the same user, indicating that the user has signed up for five accounts to circumvent the rules of the system or allowed other users to use their identification document, according to come embodiments. The same is true for connected component 1000 C, but only two user accounts have been made with that user's image from their identification document.
- FIG. 11 is a high-level block diagram illustrating physical components of a computer used as part or all of the client device from FIG. 1 , according to one embodiment. Illustrated are at least one processor 1102 coupled to a chipset 1104 . Also coupled to the chipset 1104 are a memory 1106 , a storage device 1108 , a graphics adapter 1112 , and a network adapter 1116 . A display 1118 is coupled to the graphics adapter 1112 . In one embodiment, the functionality of the chipset 1104 is provided by a memory controller hub 1120 and an I/O controller hub 1122 . In another embodiment, the memory 1106 is coupled directly to the processor 1102 instead of the chipset 1104 .
- the storage device 1108 is any non-transitory computer-readable storage medium, such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device.
- the memory 1106 holds instructions and data used by the processor 1102 .
- the graphics adapter 1112 displays images and other information on the display 1118 .
- the network adapter 1116 couples the computer 1100 to a local or wide area network.
- a computer 1100 can have different and/or other components than those shown in FIG. 4 .
- the computer 1100 can lack certain illustrated components.
- a computer 1100 acting as a server may lack a graphics adapter 1112 , and/or display 1118 , as well as a keyboard or pointing device.
- the storage device 1108 can be local and/or remote from the computer 1100 (such as embodied within a storage area network (SAN)).
- SAN storage area network
- the computer 1100 is adapted to execute computer program modules for providing functionality described herein.
- module refers to computer program logic utilized to provide the specified functionality.
- a module can be implemented in hardware, firmware, and/or software.
- program modules are stored on the storage device 1108 , loaded into the memory 1106 , and executed by the processor 1102 .
- Embodiments of the entities described herein can include other and/or different modules than the ones described here.
- the functionality attributed to the modules can be performed by other or different modules in other embodiments.
- this description occasionally omits the term “module” for purposes of clarity and convenience.
- Certain aspects of the present invention include process steps and instructions described herein in the form of an algorithm. It should be noted that the process steps and instructions of the present invention could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
- the present invention also relates to an apparatus for performing the operations herein.
- This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored on a computer readable medium that can be accessed by the computer.
- a computer program may be stored in a non-transitory computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of computer-readable storage medium suitable for storing electronic instructions, and each coupled to a computer system bus.
- the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
- the present invention is well suited to a wide variety of computer network systems over numerous topologies.
- the configuration and management of large networks comprise storage devices and computers that are communicatively coupled to dissimilar computers and storage devices over a network, such as the Internet.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Tourism & Hospitality (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Human Resources & Organizations (AREA)
- Economics (AREA)
- Primary Health Care (AREA)
- General Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Computer Security & Cryptography (AREA)
- Marketing (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Human Computer Interaction (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
Abstract
Description
- The present invention generally relates to the field of identifying fraudulent user accounts in a system, and more specifically, to detection of duplicate user accounts using deep learning.
- User accounts for various applications are primarily based on information input from associated users. Conventionally, when a user makes an account for a system, the system may request the user to enter personal information, such as name, address, or phone number, according to an embodiment. In another embodiment, the user may be required to upload an image of their identification document, used by the system to verify their identity. The system can then use the identification document to ensure that no duplicate user accounts are created for one user.
- However, there are several flaws in this method that prevent the system from properly identifying all users. In some instances, users may use fake identification documents with slightly different information or images to create multiple accounts, and the system may not recognize the accounts as duplicate. Alternatively, a user banned from the system may make another account with slightly different information but a similar identification document. In addition, the image of an identification document is often taken by the user, for example, as a picture from a phone, tablet, or digital camera, sometimes at different orientations. As a result, this may make images within an identification document appear slightly altered from their actual appearance or have other issues with quality, and the system may be unable to identify duplicate images associated with various user accounts.
- Systems and methods are disclosed herein for identifying duplicate user accounts. These systems and methods may be applied using user identifying information, such as name, address, or image of a user, which may be preprocessed to fix any errors with the image (i.e., distortion and/or orientation). The system creates a graph representing the user accounts as nodes connected by edges representing similarity scores between each pair of nodes and iterates upon removing edges representing the similarity scores below a threshold amount until only connected components of nodes for similar user accounts remain.
- In some embodiments, the system receives a plurality of user accounts and determines similarity scores indicating the similarity between each pair of user accounts in the plurality of user accounts. The system determines an initial threshold similarity score indicative of a particular degree of similarity between user accounts. The system repeats a set of steps for a plurality of iterations where each iteration has a threshold similarity score and the threshold similarity score is initialized to the initial threshold similarity score. The steps include determining one or more connected components of a graph of nodes and edges where each node represents a user account and a pair of nodes has an edge if the similarity score of the pair of nodes indicates a greater degree of similarity than indicated by the threshold similarity score and modifying the threshold similarity score for the next iteration to a value indicative of a higher degree of similarity between user accounts compared to the threshold similarity score for the current iteration. The system repeats the set of steps for of the plurality of iterations. For example, the system may repeat the process for a fixed number of iterations. Alternatively, the system may repeat the process until two subsequent iterations result in the same set of connected components. Alternatively, the system may repeat the process until an aggregate measure of sizes of the connected components indicates that the connected components are small and have an aggregate size below a threshold value. Responsive to repeating the steps in the set for the plurality of iterations, the system identifies one or more connected components each representing a set of user accounts for a particular user. The system transmits information describing the identified one or more connected components to a client device, for example, to a privileged user. In some embodiments, the system may remove one or more edges with a similarity score indicative of a degree of similarity less than the modified threshold similarity score.
-
FIG. 1 illustrates a computing environment for processing documents, according to one embodiment. -
FIG. 2 is a high-level block diagram illustrating a detailed view of the image orientation module, according to one embodiment. -
FIG. 3 is a high-level block diagram illustrating a detailed view of the duplicate user account detection module, according to one embodiment. -
FIG. 4 is a flowchart illustrating the process for image transformation from an identification document, according to one embodiment. -
FIG. 5 illustrates photo orientation corrections, according to one embodiment. -
FIG. 6A illustrates a generic photo orientation correction, according to one embodiment. -
FIG. 6B illustrates transforming the identification document to change the shape of the identification document, according to one embodiment. -
FIG. 7 illustrates example images of identification documents, according to one embodiment. -
FIG. 8 is a flowchart illustrating the process for identifying duplicate user accounts, according to one embodiment. -
FIG. 9A illustrates examples of connected components of user accounts, according to one embodiment. -
FIG. 9B illustrates an example of forming connected components after modifying the threshold value for the threshold similarity scores. -
FIG. 10A illustrates an example of a set images from identification documents at one threshold value, according to one embodiment. -
FIG. 10B illustrates an example of subsets of the set of images fromFIG. 10A , the subsets of the set of images having a greater degree of similarity than the set of images fromFIG. 10A , according to one embodiment. -
FIG. 11 is a high-level block diagram illustrating physical components of a computer used as part or all of the client device fromFIG. 1 , according to one embodiment. - The figures depict embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
- This application details a technique for determining duplicate user accounts within a system. User accounts may include user identifying information, such as name, address, or an image of an identification document. Examples of identification documents include driver's licenses, passports, or any other government-issued identification document. In some embodiments, the system employs a preprocessing technique to determine similarities between document images from separate user accounts. The system determines degrees of similarity between user accounts and may then employ a clustering algorithm to reduce the rate of false positives in similarity matches, improving the precision of the system in identifying duplicate user accounts.
-
FIG. 1 illustrates a computing environment for processing documents, according to one embodiment. The computing environment, or system, includes aclient device 110,network 120, andserver 100, according to an embodiment. These various components are now described in additional detail. - The
client device 110 is a computing device such as a smart phone, laptop computer, desktop computer, or any other device that can access thenetwork 120 and, subsequently, theserver 100. In the embodiment ofFIG. 1 , there is one client device. In other embodiments, there may be a plurality of client devices. - The
network 120 may be any suitable communications network for data transmission. In an embodiment such as that illustrated inFIG. 1 , thenetwork 120 uses standard communications technologies and/or protocols and can include the Internet. In another embodiment, the entities use custom and/or dedicated data communications technologies. - The
network 120 connects theclient device 110 to theserver 100. Theserver 100 comprises animage orientation module 130, a user account store 160, and a duplicate useraccount detection module 140. In other embodiments, there may be other modules included in the server. Theserver 100 processes identification documents (i.e., “documents,” for simplicity). Theimage orientation module 130 detects orientation of a document in a given image, for example, the position at which the document is placed and the orientation and distortion of the document in a given image. The duplicate useraccount detection module 140 detects whether two user accounts are duplicates, i.e., user accounts of the same user. The user account store 160 stores user accounts within the system. A user account is associated with a user of the system and contains identifying information such as name, address, phone number, and images of identification documents, according to an embodiment. In other embodiments, user accounts may include more or less information about users associated with the accounts. -
FIG. 1 shows one possible configuration of the system. In other embodiments, there may be more or fewer systems or components, for example, there may bemultiple servers 100 or theserver 100 may be composed of multiple systems such as individual servers or load balancers. -
FIG. 2 is a high-level block diagram illustrating a detailed view of the image orientation module, according to one embodiment. Theimage orientation module 130 receives an image of a document and determines the orientation and distortion of the document in the given image, for example, orientation of an identification document in an image submitted to theserver 100 for evaluation. The orientation of the document may include the angle of rotation, the direction of rotation, and the position of the document in the image. The various components of theimage orientation module 130 include, but not limited to, adocument store 210, animage transformation module 250, an image detection module, atext detection module 240, aneural network 260, and atraining module 270, according to one embodiment. In other embodiments, there may be other components not shown inFIG. 2 . - The
document store 210 stores images of documents, for example, electronic copies of physical identification documents associated with a user. Examples of identification documents include passport, driver license, national or state ID, and so on. Identification documents may be used to verify the identity of a user. Some examples of identification documents include drivers' licenses and passports. A user uploads an image to the system of the user's identification document. Typically, an identification document includes an image and text. The text may include information describing the user, for example, the user's name, the user's address, the user's date of birth, date the document was issued, date of expiry of the document, and an identification number. The image in the document is typically an image of the user, for example, a photo of the user. - The
image transformation module 250 processes the image of an identification document associated with a user. Theimage transformation module 250 detects the identification document within the image and transforms the image to reorient and/or scale the identification document within the image, according to an embodiment. In various embodiments, an identification document may be distorted or rotated relative to the orientation of the image itself. These orientations are further described inFIG. 6 . - The
image detection module 230 receives for processing, an input image comprising a document and detects one or more images within the bounds of the document. In an embodiment, an image detected within the document is of a user identified by the identification document, for example, an image of the face of the user. The image detected within the document may be stored in relation to the user and user account in the user account store 160. In other embodiments,image detection module 230 detects other images within the document and stores them in the user account store 160. Examples of possible images that are detected include images of the user's signature, the background of the identification document, and the shape of a geographical region associated with the identification document, like a state or province. - The
image detection module 230 also records the location of an image in a document, the size of the image, orientation of the image, relative positions of two images, relative position of an image and a text, and other parameters describing the image or images within the document. In some embodiments, theimage detection module 230 uses these features of the image to determine the location of the document in the image, so that the document may be transformed. For example, certain types of identification documents have the user's image at a particular location within the document, having a particular size with respect to the size of the document and a particular orientation with respect to the identification document or with respect to one or more text fields or text snippets present in the document. In an embodiment, theimage detection module 230 uses these features to detect the type of identification document and the correct orientation of the identification document. Accordingly, the duplicate useraccount detection module 140 extracts the features describing the images and provides them as input to a machine learning model. - The machine learning model is trained using training data set comprising images of various identification documents. The machine learning model is configured to determine scores indicative of the parameters describing the document within an input image, for example, scores indicative of the position of the document within the image, orientation of the document, and so on. In some embodiments, the
neural network 260 may automatically infer these parameters and use them for detecting type of the identification document, orientation of the identification document, and other type of information describing the document. In some embodiments, the machine learning model determines a boundary of the input image and uses the boundaries to determine the parameters. - The
text detection module 240 detects text within the transformed images of identification documents. Text may include, according to some embodiments, a user's name, address, identification number, or other identification information. In other embodiments, thetext detection module 240 detects text boundaries instead of or in conjunction with text. The detected text may be stored in the user account store 160 in relation to the associated user and user account. In some embodiments, the text from the transformed images is compared to information contained within the user account to verify the validity of the information. Thetext detection module 240 also records the location of text in a document, which may be used to determine the location and the orientation of the document within the image. - In an embodiment, the
text detection module 230 performs optical character recognition (OCR) to recognize certain snippets of the text, for example, “Name”, “Date of Birth”, “Address”, and so on. Theimage detection module 230 generates features based on positions of these text snippets within the document, for example, relative to a particular corner of the document. Theimage detection module 230 generates features based on relative positions of various text snippets, for example, the position of the “Address” snippet with respect to the “Name” snippet and so on. Theimage detection module 230 generates features based on relative positions of images within the document compared to various text snippets, for example, the position of the image of the user with respect to the “Name” snippet, or the position of a logo in the identification document with respect to the “Address” snippet, and so on. Theimage orientation module 130 provides these features as input to a machine learning model, for example, theneural network 260. - The
neural network 260 is configured to receive an encoding of an image as input and predict one or more values describing the document within the input image, for example, scores indicative of the position of the document within the input image, the orientation, or the document within the input image, and so on. In an embodiment, theneural network 260 is a deep neural network with one or more hidden layers. The hidden layers determine features of the input image that are relevant to predicting the above scores. In this embodiment, the neural network receives an encoding of an input image that is transformed by layers of artificial neurons, where the inputs for neurons at a given layer come from the previous layer, and all of the outputs for a neuron are provided as input to the subsequent layer. In an embodiment, theneural network 260 comprises an input component that provides input to a plurality of output components, each output component configured to predict a particular parameter value describing the document in an input image. In an embodiment, theneural network 260 is a convolutional neural network. In some embodiments, the neural network is included in theimage transformation module 250. - The
training module 270 trains the neural network using, for example, supervised learning techniques based on labelled training data comprising various images and their respective parameters values. The training of theneural network 260 is based on a back propagation process that adjusts the weights of the neural network to minimize an aggregate measure of error between predicted parameter values and actual parameter values provided in the training data. The training process may be repeated for each image provided in a training dataset. Although several embodiments described herein are based on neural networks, the techniques disclosed are applicable to various types of machine learning based models that may or may not be based on deep learning, for example, decision tree based models, random forest based models, logistic regression based models, and so on. -
FIG. 3 is a high-level block diagram illustrating a detailed view of the duplicate user account detection module, according to one embodiment. The duplicate useraccount detection module 140 identifies similar user accounts within the system. The various components of the duplicate useraccount detection module 140 include, but are not limited to, anode store 310, anedge store 320, aconnected component store 330, anedge remover module 340, and anedge determination module 350, according to one embodiment. In other embodiments, there may be other components not shown inFIG. 3 . The duplicate useraccount detection module 140 maintains a graph comprising nodes representing user accounts and edges representing associations between pairs of nodes that represent similar user accounts. The duplicate useraccount detection module 140 transforms the graph by iteratively modifying the graph as described herein. - The
node store 310 stores nodes associated with user accounts within the system. The user accounts are associated with information and images that may be used to identify a user. Theedge store 320 stores edges between nodes. The edges are associated with similarity scores between nodes, wherein a similarity score indicates the degree of similarity between a set of nodes. An edge exists between a pair of nodes if the similarity score between the pair of nodes exceeds a threshold value. - The
edge determination module 350 determines the edges between sets of nodes. Theedge determination module 350 compares the user accounts associated with a set of nodes to determine a similarity score indicating a degree of similarity between the user accounts. In some embodiments, theedge determination module 350 compares the information of the user accounts to determine duplicate information. The information of the user accounts may be entered by a user or may be gathered from the text detected from the user's associated identification document. Further, in some embodiments, theedge determination module 350 compares the images from identification documents associated with user accounts for similarity by comparing the location of pixels within the images. For example, the edge determination module may use facial recognition between images on identification documents to determine if the identification documents (and therefore user accounts) represent the same user. The edge determination module may convert images of user's faces on identification documents into an embeddings (i.e., multi-dimensional vectors describing characteristics of the faces) and use distance between vectors to determine a degree of similarity between user's faces of different user accounts. In some embodiments, a neural network may be used to determine similarity between user accounts. For example, the neural network may be trained on labelled sets of known duplicate user accounts to determine a degree (or percentage) of similarity, represented as a similarity score, between user accounts based on embeddings describing the users' faces. The similarity score is stored in association with the edge. - The connected
component store 330 stores the nodes and edges that form a connected component. Connected components indicate a high degree of similarity between the nodes in the connected components. The nodes of the connected components are interconnected to multiple other nodes within the connected component, according to some embodiments. In some embodiments, thenode store 310,edge store 320, andconnected component store 330 are combined such that nodes, edges, and connected components are stored together. Although the techniques disclosed herein determine connected components of the graph to identify duplicate user accounts, the system may use other techniques for dividing a graph into subgraphs representing duplicate user accounts. For example, some embodiments may use clustering algorithm to divide a graph into clusters of nodes based on certain criteria, for example, a measure of connectivity between nodes of the cluster. Each cluster determines by such a process comprises duplicate user accounts. - The edge remover
module 340 removes edges between nodes. In an embodiment, the edge remover module simply associates an edge with a flag indicating that the edge is removed. In other embodiments, theedge remover module 340 deletes a representation of the edge from a data structure representing the graph. The edge removermodule 340 determines a threshold similarity score. The edge removermodule 340 compares the similarity scores associated with edges to the threshold score and removes edges with a similarity score that indicates a lower degree of similarity than the threshold score. The threshold score may be updated to indicate higher similarity between nodes as the process is performed to remove more edges to find nodes with higher similarity. As the threshold score value is updated to indicate higher similarity, the number of edges of the graph decreases since edges indicative of similarity less that the degree of similarity corresponding to the threshold score are removed. As a result, the number of connected components of the graph increases and the average size of connected components decreases. -
FIG. 4 is a flowchart illustrating the process for image transformation from an identification document, according to one embodiment. The system gathers 400 an image of an identification document. The image is input by a user of the system to show proof of and verification of their identity through an identification document. In some embodiments, the user may enter the image to the system directly from a camera on theclient device 110. The image is stored in relation to a user account store of the user in the user account store 160. Once the image has been entered into the system by the user, the system provides 410 the image of the identification document as input to theneural network 260, which determines using theimage detection module 230 and thetext detection module 240 parameters describing the document, for example, the location of the document within the image and the orientation of the identification document in the image. - The system uses the
neural network 260 to determine 420 the position and the orientation of the identification document in the image. In an embodiment, theneural network 260 is further configured to determine 420 the bounding box and aspect ratio of the image. For example, a certain point of the image may be considered as the origin and the position of the document may be determined as the coordinates of a point of the document such as a corner. The system may represent the orientation of the image using an angle. For example, certain orientation of the document may be considered as a default orientation and any other orientation may be represented using an angle by which the document needs to be rotated to reach that orientation. The system may further represent dimensions of the document using a scaling factor. For example, a particular size of the document may be considered as the default size. If the image is captured such that the document is much smaller than the default size, the system stores a scaling factor indicating the actual size of the document compared to the default size. Once the parameters of the identification document, including the position, orientation, and dimensions, have been confirmed, the system transforms 430 the images to change the parameters of the identification document to standardized values. In an embodiment, the system further extracts areas of interest from the image, for example, the system may extract a portion of the image that shows the document if the image includes objects or patterns in the background other than the document. - Although
FIG. 4 illustrates a number of interactions according to one embodiment, the precise interactions and/or order of interactions may vary in different embodiments. For example, the neural network may be configured to receive an input image of a document and output parameters for transforming the image of the identification document to generate an image displaying the identification document in a canonical form. In some embodiments, the image is received from a user. Further, upon transforming the image, the system may send the transformed image to theclient device 110 of the user associated with the image or an administrator, for display. - The system transforms the images to change the location and orientation and fix the distortion of the identification document to a canonical form. For example, a particular point of the image is considered an origin, for example, the lower left corner. A particular orientation of the identification document is considered a canonical orientation, for example, the orientation in which the image of the person identified is displayed in a position in which the head of a person standing upright would face the viewer of the image. Furthermore, in a canonical orientation, the identification document has edges parallel to the edges of the image. In the canonical orientation, the position of the identification document is such that the lower left corner of the identification document is within a threshold of the lower left corner of the image when displayed on a display screen. For example, the lower left corner of the identification document may overlap with the lower left corner of the image. In the canonical form, the identification document has a size that is within a threshold percentage of the size of the image, for example, the dimensions of the identification document are at least 80% of the dimensions of the image. Furthermore, the shape of the identification document in a canonical form is rectangular. The
image transformation module 250 performs geometric transformation of the identification document such that the transformed identification document is in a canonical form, also known as fixing the distortion of the identification document. Accordingly, theimage transformation module 250 may enlarge the identification document if the identification document in the image is below a threshold size; theimage transformation module 250 may move the position of the identification document within the image to bring the identification to a canonical position; theimage transformation module 250 may rotate the identification document in the image to change the orientation of the identification document to a canonical orientation; and if the identification document is not in a rectangular shape, theimage transformation module 250 may stretch the identification document such that one side of the document is increased in size more than another size to transform the identification document to a rectangular shape. -
FIG. 5 illustrates photo orientation corrections, according to one embodiment. In this example, the identification document, or document 500, is a driver's license, or “Driver ID.” The document 500 may have different orientations, such asdocument 500A,document 500B, and document 500C.Document 500A shows an embodiment where the document is orientated 90 degrees to the left from the canonical (or standardized) orientation shown in the embodiment ofdocument 500D.Document 500B shows an embodiment where the document is orientated 180 degrees fromdocument 500D.Document 500A shows an embodiment where the document is orientated ninety degrees to the right ofdocument 500D. In other embodiments, the angle of the orientation that differs from the canonical orientation ofdocument 500D may be any angle between 0 and 360 degrees. - After the orientation of the document 500 is detected using the
image detection module 230 and thetext detection module 240, theimage transformation module 250 performs a photo orientation correction, as shown in the figure, to transform the orientation ofdocument 500A,document 500B, and document 500C to the orientation ofdocument 500D. -
FIG. 6A illustrates a generic photo orientation correction, according to one embodiment. In this example, the identification document, or document 600A, is a driver's license, or “Driver ID.” Thedocument 600A includes animage 610A that depicts a user associated with thedocument 600A. Thedocument 600A is rotated an atangle 620, θ, of a value between 0 and 360 degrees from a standardized orientation, such as the orientation ofdocument 500D shown inFIG. 5 . Theheight 640, or h′, of thebounding box 650 of thedocument 600A may be determined using the geometric equation w′=w*cos(θ)+h*sin(θ), where w is the width of thedocument 600A and his the height of thedocument 600A. Thewidth 630, or w′, of thebounding box 650 of thedocument 600A may also be determined using the geometric equation h′=h*cos(θ)+w*sin(θ). Theimage 610A is also rotated by thesame angle 620 from the standardized orientation as the rest of thedocument 600A. This information may be used by the image detection module to determine the orientation of thedocument 600A once it has determined the location ofdocument 600A in an image. -
FIG. 6B illustrates transforming the identification document to change the shape of the identification document, according to one embodiment. The system detects that the identification document needs a correction based on the shape of the identification document in the image since the identification document is trapezoidal in shape with two unequal parallel sides rather than rectangular with equal parallel sides. The system performs the correction by transforming the identification document to stretch the dimensions of the document, thereby generating a rectangular identification document image. In another embodiment, the system detects the boundary of thedocument 600B and performs the correction based on the dimensions of the boundary. In this example, the identification document, or document 600, is a driver's license, or “Driver ID.” Thedocument 600B includes animage 610B that depicts a user associated with thedocument 600B. In the embodiment depicted inFIG. 6B , the document is rotated into the image, such that rotating the document out, where out is depicted in the direction of the arrows, would transform thedocument 600B to the canonical shape (i.e., rectangular shape). Theimage 610B is also rotated inward by the same amount from the standardized orientation as the rest of thedocument 600B. This information is used by the image detection module to determine the orientation of thedocument 600B once it has determined the location ofdocument 600B in an image. -
FIG. 7 illustrates example images of documents, according to one embodiment. The example images are oriented at different example orientations, none of which are exactly the standardized orientation. Though some orientations may appear close to the standardized orientation ofdocument 500D depicted inFIG. 5 , the user placement of the identification documents in each image is slightly different, and therefore the images may have to be transformed for the identification documents to be in the standardized orientation. In addition, the background of each image is different, and the system only needs the document itself, not the background, which may be distracting based on patterns and objects included in the background. Therefore, in some embodiments, theimage transformation module 250 removes the background of the image to leave only the identification document in the standardized orientation. - Once an identification document is transformed to a canonical form, the identification document is stored in association with the user account. Images of identification documents that have been transformed to canonical form can be compared with higher accuracy. In one embodiment, the system uses image processing techniques for comparing images of faces of people to determine whether the images represent the face of the same person. In another embodiment, the system uses machine learning based techniques, for example, deep learning based techniques for comparing images of faces of people to determine whether two images show the face of the same user. The system uses image comparison as well as comparison of user account information to determine whether two user accounts belong to the same user. This allows the system to identify duplicate user accounts and take appropriate user actions, for example, sending a message to the user to consolidate the user accounts or to disable at least some of the user accounts. Users may create multiple user accounts to bypass certain checks performed by the system based on policies. For example, if a user account is flagged as violating certain policy enforced by the system, the user may create an alternate account. Similarly, if the system enforces certain quota per user, a user may create multiple user accounts to game the system, thereby exceeding the quota. Embodiments of the invention detect duplicate user accounts to ensure that each user has a single user account, thereby enforcing the policies strictly.
-
FIG. 8 is a flowchart illustrating the process for identifying duplicate user accounts, according to one embodiment. The system receives 800 a plurality of user accounts and, for each of a plurality of pairs of user accounts, determines 805 a similarity score indicative of similarity between a first user account and a second user account in the pair. The system determines 810 an initial threshold similarity score that is indicative of a particular degree of similarity between user accounts. This initial threshold similarity score is used to determine which user accounts are similar to one another and which user accounts are not similar to one another. The system repeats the following 815 and 820 for a plurality of iterations to determine connected components of user accounts, where connected nodes in a graph represent similar user accounts. Each iteration of the steps has a threshold similarity score, which is initialized to the initial threshold similarity score.steps - The system determines 815 one or more connected components in the graph of nodes and edges. The nodes represent user accounts, and a pair of nodes has an edge is the similarity score of the pair of nodes indicates a greater degree of similarity than that indicated by the threshold similarity score. In some embodiments, a greater number for a similarity score may indicate a greater degree of similarity. In other embodiments, a smaller number for a similarity score may indicate a greater degree of similarity. The system modifies 820 the threshold similarity score for the next iteration to a value indicative of a higher degree of similarity between user accounts compared with the threshold similarity score for the current iteration. In some embodiments, the system removes edges with a similarity score with a degree of similarity less than the modified threshold similarity score. Accordingly, the initial connected components include user accounts that may not be very similar but as the iterations proceed the user accounts in each connected component are more likely to be similar and represent duplicate user accounts. The system repeats the
815 and 820 until the user accounts are within a certain degree of similarity. This may be determined by the size of connected components or the number of connected components or the value of the similarity score. For example, in some embodiments, the system repeats thesteps 815 and 820 until the system can no longer remove edges from the connected components due to a high degree of similarity.steps - Responsive to repeating the steps for a plurality of iterations, the system identifies 825 one or more connected components, where each connected component represents sets of user accounts for a particular user. The system stops the iterations based on certain criteria. For example, in an embodiment, the system repeats the process for a fixed number of iterations. In another embodiment, the system stops the iterations if there are no changes in the connected components between subsequent iterations or if the changes in the connected components are below a threshold amount between iterations. In some embodiments, the system stops the iterations if the number of connected components exceeds a threshold. In some embodiments, the system determines whether an aggregate measure based on sizes of connected components reaches below a threshold, thereby indicating that the connected components are dense (i.e., most, if not all, nodes in the connected component are connected to one another). The system transmits 830 information describing the identified one or more connected components to a privileged user, for example, an analyst for verification. At the end of the process, the system has connected components of user accounts that represent duplicate accounts with a high likelihood.
- In some embodiments, the system uses a connection ratio threshold to determine whether to alter the threshold similarity score. The connection ratio threshold represents how dense a connected component is (i.e. the number of edges per number of nodes that must exist within a connected component for the system to indicate that the connected component likely contains duplicate user accounts). The connection ratio threshold may be specified by a user, for example, as a system configuration parameter specified by a system administrator. Alternatively, the system may analyze previous results to estimate a connection ratio threshold. For example, the system identifies various connected components determined during previous executions of the process illustrated in
FIG. 8 . The system marks the nodes that were determined to represent duplicate user accounts at the end of execution of the process. For each connected component, the system identifies the number of edges in the connected component and determines whether the connected component contains duplicate user accounts. The system determines an aggregate measure of number of edges of connected components that contain duplicate user accounts. In an embodiment, the connection ratio threshold is a value determined to be aggregate measure of ratios of number of edges of connected components containing duplicate user accounts and the size of the connected component as indicated by the number of nodes of the connected component. - The system saves connected components with more edges as determined using the connection ratio threshold. These connected components may also be referred to as dense connected components, which contain nodes of likely duplicate user accounts. The system determines if a connected component is sparse, i.e., it has a small number of edges compared to the size of the connected component by comparing the ratio of the number of edges of the connected component and the number of nodes of the connected component with the connection ratio threshold. If the system determines that the ratio of the number of edges of the connected component and the number of nodes of the connected component is smaller than the connection ratio threshold, the system modifies the threshold similarity score representing the degree of similarity of connected components to break the connected components into smaller, denser connected components. For example, a connected component with 5 nodes would have 10 edges if the connected component is fully-connected. A connection ratio of 7 may indicate that the connected component must be connected by at least 70% of the maximum number of edges. Accordingly, the system saves dense connected components with 7 or more edges and alters the threshold similarity score to remove some edges to divide a sparse connected component into smaller, denser connected components.
- In some embodiments, the system calculates a connectivity ratio for each connected component and compares the connectivity ratio to the connection ratio threshold. The connectivity ratio may be a relationship between the number of edges, k, and nodes, n, in a connected component, as show in
Equation 1. -
- In embodiments where the connected component is fully-connected, the connectivity ratio may be represented by
Equation 2. -
- If the system determines that the connectivity ratio is smaller than the connection ratio threshold, the system modifies the threshold similarity score representing the degree of similarity of connected components to break the connected components into smaller, denser connected components.
- The system stores the connected component information by associating user accounts belonging to the same connected component. In some embodiments, the system sends messages to users determined to have duplicate user accounts requesting the users to consolidate the user accounts or delete additional user accounts. In some embodiments, the system disables one or more user accounts from a connected component. For example, the system identifies the oldest user account and keeps it active and disables all the remaining user accounts in the connected component. In an embodiment, the system identifies the user account that is associated with the most level of activity and disables the remaining user accounts. The system may disable a user account by preventing the user from using the account unless the user provides additional authentication information or calls and talks to a customer service representative to provide authentication information. If a user provides information indicating that a user account in the connected component is not a duplicate of another user account in the connected component, the information is stored in the user account store 160 and used the next time the duplicate detection process is executed. In an embodiment, the user account store 160 maintains a table storing relations between user accounts that have been verified as belonging to distinct users. Each user account may have a user account identifier that uniquely identifies the user account and the table stores pairs of user account identifiers for user accounts that are verified as belonging to distinct users. Accordingly, an edge between two user accounts is removed (or never created when the graph is initialized) if the two user accounts have been previously verified as being distinct user accounts.
- It is appreciated that although
FIG. 8 illustrates a number of interactions according to one embodiment, the precise interactions and/or order of interactions may vary in different embodiments. For example, the steps may only be repeated once for a particular threshold similarity score, according to some embodiments. In other embodiments, the steps may be repeated until a threshold level of connected components of user accounts have been formed or some other threshold condition has been met. -
FIG. 9A illustrates examples of connected components of user accounts, according to one embodiment. A fully-connected connected component 900 has nodes that are connected to all other nodes in the connected component by edges. This type of connected component indicates that all of the user accounts associated with the nodes are within some degree of similarity to one another (i.e., directly connected to one another). 910, 920, 930 are examples of the low-quality connected components that need to be regenerated since not every node is connected by a degree of similarity. A star shape connectedConnected components component 910 has a plurality of nodes all connected to one center node. This type of connected component indicates that the plurality of nodes are all within a degree of similarity to the center node but not within a the degree of similarity to one another. A chain shape connectedcomponent 920 has a plurality of nodes connected in chains. This type of connected component indicates that nodes are connected to one another within a degree of similarity but are not all similar to one another within that degree. In some embodiments, each node in a chain shape connectedcomponent 920 is each only connected to two other nodes. In other embodiments, some nodes in the connected component may be connected to more than two nodes, but some nodes in the connected component must only be connected to two nodes maximum. A connected component of sub-components 930 connected nodes in separate connected components into one connected component. This indicates that the connected components are similar in some way. The connected components are connected by inside nodes, which are the nodes that connect the connected components to one another. In some embodiments, there may be more than one pair of inside nodes connecting a connected component. -
FIG. 9B illustrates an example of forming connected components after modifying the threshold value for the threshold similarity scores. In this example, a larger number indicates a greater degree of similarity.FIG. 9B example depicts a connected component of sub-components 940A, wherein the edges in the connected component have a similarity score greater than 0.23. When the threshold value is updated to a value above 0.23, edges are removed from the connected component of sub-components 940A. This results in the removal of an inside edge that was connecting two connected components, connectedcomponent 940B andconnected component 940C, together.Connected component 940B andconnected component 940A have nodes with a higher degree of similarity than the nodes in the connected component of sub-components 940A. -
FIG. 10A illustrates an example of a set images from documents at one threshold value, according to one embodiment. In this example, a larger number indicates a greater degree of similarity. Each image inconnected component 1000A represents a node and depicts a user associated with a different user account. The degree of similarity between the images is 0.23214. -
FIG. 10B illustrates an example of subset of the set of images fromFIG. 10A , the subsets of the set of images having a greater degree of similarity than the set of images fromFIG. 10A , according to one embodiment. The degree of similarity between the images inconnected component 1000B is 0.5, and the degree of similarity between the images inconnected component 1000C is 0.5. The images inconnected component 1000B appear to depict the same user, indicating that the user has signed up for five accounts to circumvent the rules of the system or allowed other users to use their identification document, according to come embodiments. The same is true forconnected component 1000C, but only two user accounts have been made with that user's image from their identification document. -
FIG. 11 is a high-level block diagram illustrating physical components of a computer used as part or all of the client device fromFIG. 1 , according to one embodiment. Illustrated are at least oneprocessor 1102 coupled to achipset 1104. Also coupled to thechipset 1104 are amemory 1106, astorage device 1108, a graphics adapter 1112, and anetwork adapter 1116. Adisplay 1118 is coupled to the graphics adapter 1112. In one embodiment, the functionality of thechipset 1104 is provided by a memory controller hub 1120 and an I/O controller hub 1122. In another embodiment, thememory 1106 is coupled directly to theprocessor 1102 instead of thechipset 1104. - The
storage device 1108 is any non-transitory computer-readable storage medium, such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device. Thememory 1106 holds instructions and data used by theprocessor 1102. The graphics adapter 1112 displays images and other information on thedisplay 1118. Thenetwork adapter 1116 couples thecomputer 1100 to a local or wide area network. - As is known in the art, a
computer 1100 can have different and/or other components than those shown inFIG. 4 . In addition, thecomputer 1100 can lack certain illustrated components. In one embodiment, acomputer 1100 acting as a server may lack a graphics adapter 1112, and/ordisplay 1118, as well as a keyboard or pointing device. Moreover, thestorage device 1108 can be local and/or remote from the computer 1100 (such as embodied within a storage area network (SAN)). - As is known in the art, the
computer 1100 is adapted to execute computer program modules for providing functionality described herein. As used herein, the term “module” refers to computer program logic utilized to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, and/or software. In one embodiment, program modules are stored on thestorage device 1108, loaded into thememory 1106, and executed by theprocessor 1102. - Embodiments of the entities described herein can include other and/or different modules than the ones described here. In addition, the functionality attributed to the modules can be performed by other or different modules in other embodiments. Moreover, this description occasionally omits the term “module” for purposes of clarity and convenience.
- The present invention has been described in particular detail with respect to one possible embodiment. Those of skill in the art will appreciate that the invention may be practiced in other embodiments. First, the particular naming of the components and variables, capitalization of terms, the attributes, data structures, or any other programming or structural aspect is not mandatory or significant, and the mechanisms that implement the invention or its features may have different names, formats, or protocols. Also, the particular division of functionality between the various system components described herein is merely for purposes of example, and is not mandatory; functions performed by a single system component may instead be performed by multiple components, and functions performed by multiple components may instead performed by a single component.
- Some portions of above description present the features of the present invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. These operations, while described functionally or logically, are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules or by functional names, without loss of generality.
- Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
- Certain aspects of the present invention include process steps and instructions described herein in the form of an algorithm. It should be noted that the process steps and instructions of the present invention could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
- The present invention also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored on a computer readable medium that can be accessed by the computer. Such a computer program may be stored in a non-transitory computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of computer-readable storage medium suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
- The algorithms and operations presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will be apparent to those of skill in the art, along with equivalent variations. In addition, the present invention is not described with reference to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any references to specific languages are provided for invention of enablement and best mode of the present invention.
- The present invention is well suited to a wide variety of computer network systems over numerous topologies. Within this field, the configuration and management of large networks comprise storage devices and computers that are communicatively coupled to dissimilar computers and storage devices over a network, such as the Internet.
- Finally, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
Claims (20)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/832,726 US20200311844A1 (en) | 2019-03-27 | 2020-03-27 | Identifying duplicate user accounts in an identification document processing system |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201962825016P | 2019-03-27 | 2019-03-27 | |
| US16/832,726 US20200311844A1 (en) | 2019-03-27 | 2020-03-27 | Identifying duplicate user accounts in an identification document processing system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| US20200311844A1 true US20200311844A1 (en) | 2020-10-01 |
Family
ID=72605949
Family Applications (4)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/832,711 Active 2040-12-15 US11449960B2 (en) | 2019-03-27 | 2020-03-27 | Neural network based identification document processing system |
| US16/832,726 Abandoned US20200311844A1 (en) | 2019-03-27 | 2020-03-27 | Identifying duplicate user accounts in an identification document processing system |
| US17/948,068 Active 2040-06-29 US12182895B2 (en) | 2019-03-27 | 2022-09-19 | Neural network based identification document processing system |
| US18/954,366 Pending US20250148562A1 (en) | 2019-03-27 | 2024-11-20 | Neural network based identification document processing system |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/832,711 Active 2040-12-15 US11449960B2 (en) | 2019-03-27 | 2020-03-27 | Neural network based identification document processing system |
Family Applications After (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/948,068 Active 2040-06-29 US12182895B2 (en) | 2019-03-27 | 2022-09-19 | Neural network based identification document processing system |
| US18/954,366 Pending US20250148562A1 (en) | 2019-03-27 | 2024-11-20 | Neural network based identification document processing system |
Country Status (1)
| Country | Link |
|---|---|
| US (4) | US11449960B2 (en) |
Cited By (26)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11188740B2 (en) * | 2019-12-18 | 2021-11-30 | Qualcomm Incorporated | Two-pass omni-directional object detection |
| US20220005173A1 (en) * | 2020-07-02 | 2022-01-06 | Tul Corporation | Image identification method and system |
| US20220100993A1 (en) * | 2020-09-28 | 2022-03-31 | Rakuten Group, Inc. | Verification system, verification method, and information storage medium |
| US20220182497A1 (en) * | 2020-12-07 | 2022-06-09 | Canon Kabushiki Kaisha | Image processing system, image processing apparatus, control method |
| US20220191235A1 (en) * | 2020-12-11 | 2022-06-16 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for improving security |
| WO2022247955A1 (en) * | 2021-05-28 | 2022-12-01 | 百果园技术(新加坡)有限公司 | Abnormal account identification method, apparatus and device, and storage medium |
| CN115841334A (en) * | 2022-12-19 | 2023-03-24 | 中国平安人寿保险股份有限公司 | Abnormal account identification method and device, electronic equipment and storage medium |
| US11676285B1 (en) | 2018-04-27 | 2023-06-13 | United Services Automobile Association (Usaa) | System, computing device, and method for document detection |
| US11704634B1 (en) | 2007-09-28 | 2023-07-18 | United Services Automobile Association (Usaa) | Systems and methods for digital signature detection |
| US11721117B1 (en) | 2009-03-04 | 2023-08-08 | United Services Automobile Association (Usaa) | Systems and methods of check processing with background removal |
| US11749007B1 (en) | 2009-02-18 | 2023-09-05 | United Services Automobile Association (Usaa) | Systems and methods of check detection |
| US11756009B1 (en) | 2009-08-19 | 2023-09-12 | United Services Automobile Association (Usaa) | Apparatuses, methods and systems for a publishing and subscribing platform of depositing negotiable instruments |
| US11783306B1 (en) | 2008-02-07 | 2023-10-10 | United Services Automobile Association (Usaa) | Systems and methods for mobile deposit of negotiable instruments |
| US11797960B1 (en) | 2012-01-05 | 2023-10-24 | United Services Automobile Association (Usaa) | System and method for storefront bank deposits |
| US11875314B1 (en) | 2006-10-31 | 2024-01-16 | United Services Automobile Association (Usaa) | Systems and methods for remote deposit of checks |
| US11893628B1 (en) | 2010-06-08 | 2024-02-06 | United Services Automobile Association (Usaa) | Apparatuses, methods and systems for a video remote deposit capture platform |
| US11900755B1 (en) * | 2020-11-30 | 2024-02-13 | United Services Automobile Association (Usaa) | System, computing device, and method for document detection and deposit processing |
| US12002449B1 (en) | 2016-01-22 | 2024-06-04 | United Services Automobile Association (Usaa) | Voice commands for the visually impaired |
| US12002016B1 (en) | 2006-10-31 | 2024-06-04 | United Services Automobile Association (Usaa) | Systems and methods for remote deposit of checks |
| US20240312173A1 (en) * | 2023-03-13 | 2024-09-19 | Capital One Services, Llc | Automatic orientation correction for captured images |
| US12131300B1 (en) | 2009-08-28 | 2024-10-29 | United Services Automobile Association (Usaa) | Computer systems for updating a record to reflect data contained in image of document automatically captured on a user's remote mobile phone using a downloaded app with alignment guide |
| US12159310B1 (en) | 2009-08-21 | 2024-12-03 | United Services Automobile Association (Usaa) | System and method for mobile check deposit enabling auto-capture functionality via video frame processing |
| US12175439B1 (en) | 2007-10-23 | 2024-12-24 | United Services Automobile Association (Usaa) | Image processing |
| US12182781B1 (en) | 2013-09-09 | 2024-12-31 | United Services Automobile Association (Usaa) | Systems and methods for remote deposit of currency |
| US12184925B1 (en) | 2015-12-22 | 2024-12-31 | United Services Automobile Association (Usaa) | System and method for capturing audio or video data |
| US12211095B1 (en) | 2024-03-01 | 2025-01-28 | United Services Automobile Association (Usaa) | System and method for mobile check deposit enabling auto-capture functionality via video frame processing |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10616443B1 (en) * | 2019-02-11 | 2020-04-07 | Open Text Sa Ulc | On-device artificial intelligence systems and methods for document auto-rotation |
| FR3095371B1 (en) * | 2019-04-25 | 2021-04-30 | Idemia Identity & Security France | Method for authenticating an individual's identity document and possibly for authenticating said individual |
| US11003937B2 (en) * | 2019-06-26 | 2021-05-11 | Infrrd Inc | System for extracting text from images |
| EP4105825A1 (en) * | 2021-06-14 | 2022-12-21 | Onfido Ltd | Generalised anomaly detection |
| CN113326267B (en) * | 2021-06-24 | 2023-08-08 | 长三角信息智能创新研究院 | Address matching method based on inverted index and neural network algorithm |
| CN114531597A (en) * | 2021-12-29 | 2022-05-24 | 福建正孚软件有限公司 | Image information coding method and storage medium |
| CN117173545B (en) * | 2023-11-03 | 2024-01-30 | 天逸财金科技服务(武汉)有限公司 | A method for identifying original certificates and licenses based on computer graphics |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7899825B2 (en) | 2001-06-27 | 2011-03-01 | SAP America, Inc. | Method and apparatus for duplicate detection |
| US7725421B1 (en) | 2006-07-26 | 2010-05-25 | Google Inc. | Duplicate account identification and scoring |
| US9230103B2 (en) | 2011-10-03 | 2016-01-05 | Zoosk, Inc. | System and method for registering users for communicating information on a web site |
| US10019466B2 (en) | 2016-01-11 | 2018-07-10 | Facebook, Inc. | Identification of low-quality place-entities on online social networks |
| US10902252B2 (en) * | 2017-07-17 | 2021-01-26 | Open Text Corporation | Systems and methods for image based content capture and extraction utilizing deep learning neural network and bounding box detection training techniques |
-
2020
- 2020-03-27 US US16/832,711 patent/US11449960B2/en active Active
- 2020-03-27 US US16/832,726 patent/US20200311844A1/en not_active Abandoned
-
2022
- 2022-09-19 US US17/948,068 patent/US12182895B2/en active Active
-
2024
- 2024-11-20 US US18/954,366 patent/US20250148562A1/en active Pending
Cited By (37)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12182791B1 (en) | 2006-10-31 | 2024-12-31 | United Services Automobile Association (Usaa) | Systems and methods for remote deposit of checks |
| US12002016B1 (en) | 2006-10-31 | 2024-06-04 | United Services Automobile Association (Usaa) | Systems and methods for remote deposit of checks |
| US11875314B1 (en) | 2006-10-31 | 2024-01-16 | United Services Automobile Association (Usaa) | Systems and methods for remote deposit of checks |
| US11704634B1 (en) | 2007-09-28 | 2023-07-18 | United Services Automobile Association (Usaa) | Systems and methods for digital signature detection |
| US12175439B1 (en) | 2007-10-23 | 2024-12-24 | United Services Automobile Association (Usaa) | Image processing |
| US12229737B2 (en) | 2008-02-07 | 2025-02-18 | United Services Automobile Association (Usaa) | Systems and methods for mobile deposit of negotiable instruments |
| US11783306B1 (en) | 2008-02-07 | 2023-10-10 | United Services Automobile Association (Usaa) | Systems and methods for mobile deposit of negotiable instruments |
| US11749007B1 (en) | 2009-02-18 | 2023-09-05 | United Services Automobile Association (Usaa) | Systems and methods of check detection |
| US11721117B1 (en) | 2009-03-04 | 2023-08-08 | United Services Automobile Association (Usaa) | Systems and methods of check processing with background removal |
| US11756009B1 (en) | 2009-08-19 | 2023-09-12 | United Services Automobile Association (Usaa) | Apparatuses, methods and systems for a publishing and subscribing platform of depositing negotiable instruments |
| US12211015B1 (en) | 2009-08-19 | 2025-01-28 | United Services Automobile Association (Usaa) | Apparatuses, methods and systems for a publishing and subscribing platform of depositing negotiable instruments |
| US12008522B1 (en) | 2009-08-19 | 2024-06-11 | United Services Automobile Association (Usaa) | Apparatuses, methods and systems for a publishing and subscribing platform of depositing negotiable instruments |
| US12159310B1 (en) | 2009-08-21 | 2024-12-03 | United Services Automobile Association (Usaa) | System and method for mobile check deposit enabling auto-capture functionality via video frame processing |
| US12131300B1 (en) | 2009-08-28 | 2024-10-29 | United Services Automobile Association (Usaa) | Computer systems for updating a record to reflect data contained in image of document automatically captured on a user's remote mobile phone using a downloaded app with alignment guide |
| US12400257B1 (en) | 2010-06-08 | 2025-08-26 | United Services Automobile Association (Usaa) | Automatic remote deposit image preparation apparatuses, methods and systems |
| US11893628B1 (en) | 2010-06-08 | 2024-02-06 | United Services Automobile Association (Usaa) | Apparatuses, methods and systems for a video remote deposit capture platform |
| US11915310B1 (en) | 2010-06-08 | 2024-02-27 | United Services Automobile Association (Usaa) | Apparatuses, methods and systems for a video remote deposit capture platform |
| US12062088B1 (en) | 2010-06-08 | 2024-08-13 | United Services Automobile Association (Usaa) | Apparatuses, methods, and systems for remote deposit capture with enhanced image detection |
| US11797960B1 (en) | 2012-01-05 | 2023-10-24 | United Services Automobile Association (Usaa) | System and method for storefront bank deposits |
| US12182781B1 (en) | 2013-09-09 | 2024-12-31 | United Services Automobile Association (Usaa) | Systems and methods for remote deposit of currency |
| US12184925B1 (en) | 2015-12-22 | 2024-12-31 | United Services Automobile Association (Usaa) | System and method for capturing audio or video data |
| US12002449B1 (en) | 2016-01-22 | 2024-06-04 | United Services Automobile Association (Usaa) | Voice commands for the visually impaired |
| US11676285B1 (en) | 2018-04-27 | 2023-06-13 | United Services Automobile Association (Usaa) | System, computing device, and method for document detection |
| US11188740B2 (en) * | 2019-12-18 | 2021-11-30 | Qualcomm Incorporated | Two-pass omni-directional object detection |
| US11954847B2 (en) * | 2020-07-02 | 2024-04-09 | Tul Corporation | Image identification method and system |
| US20220005173A1 (en) * | 2020-07-02 | 2022-01-06 | Tul Corporation | Image identification method and system |
| US11482028B2 (en) * | 2020-09-28 | 2022-10-25 | Rakuten Group, Inc. | Verification system, verification method, and information storage medium |
| US20220100993A1 (en) * | 2020-09-28 | 2022-03-31 | Rakuten Group, Inc. | Verification system, verification method, and information storage medium |
| US11900755B1 (en) * | 2020-11-30 | 2024-02-13 | United Services Automobile Association (Usaa) | System, computing device, and method for document detection and deposit processing |
| US12260700B1 (en) | 2020-11-30 | 2025-03-25 | United Services Automobile Association (Usaa) | System, computing device, and method for document detection and deposit processing |
| US12113938B2 (en) * | 2020-12-07 | 2024-10-08 | Canon Kabushiki Kaisha | Image processing system, image processing apparatus, control method |
| US20220182497A1 (en) * | 2020-12-07 | 2022-06-09 | Canon Kabushiki Kaisha | Image processing system, image processing apparatus, control method |
| US20220191235A1 (en) * | 2020-12-11 | 2022-06-16 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for improving security |
| WO2022247955A1 (en) * | 2021-05-28 | 2022-12-01 | 百果园技术(新加坡)有限公司 | Abnormal account identification method, apparatus and device, and storage medium |
| CN115841334A (en) * | 2022-12-19 | 2023-03-24 | 中国平安人寿保险股份有限公司 | Abnormal account identification method and device, electronic equipment and storage medium |
| US20240312173A1 (en) * | 2023-03-13 | 2024-09-19 | Capital One Services, Llc | Automatic orientation correction for captured images |
| US12211095B1 (en) | 2024-03-01 | 2025-01-28 | United Services Automobile Association (Usaa) | System and method for mobile check deposit enabling auto-capture functionality via video frame processing |
Also Published As
| Publication number | Publication date |
|---|---|
| US20200311409A1 (en) | 2020-10-01 |
| US11449960B2 (en) | 2022-09-20 |
| US12182895B2 (en) | 2024-12-31 |
| US20230008443A1 (en) | 2023-01-12 |
| US20250148562A1 (en) | 2025-05-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12182895B2 (en) | Neural network based identification document processing system | |
| US10785241B2 (en) | URL attack detection method and apparatus, and electronic device | |
| US10839238B2 (en) | Remote user identity validation with threshold-based matching | |
| US9485204B2 (en) | Reducing photo-tagging spam | |
| US9147127B2 (en) | Verification of user photo IDs | |
| US11550996B2 (en) | Method and system for detecting duplicate document using vector quantization | |
| CN114730371B (en) | Detecting hostile instances in a biometric-based authentication system using a registered biometric data set | |
| WO2022142032A1 (en) | Handwritten signature verification method and apparatus, computer device, and storage medium | |
| US11698956B2 (en) | Open data biometric identity validation | |
| WO2018072028A1 (en) | Face authentication to mitigate spoofing | |
| WO2021042544A1 (en) | Facial verification method and apparatus based on mesh removal model, and computer device and storage medium | |
| WO2022078168A1 (en) | Identity verification method and apparatus based on artificial intelligence, and computer device and storage medium | |
| Hao et al. | It doesn't look like anything to me: using diffusion model to subvert visual phishing detectors | |
| Wang et al. | Beyond boundaries: A comprehensive survey of transferable attacks on ai systems | |
| US12393751B2 (en) | System and method for authentication of rareness of a digital asset | |
| CN118316699B (en) | Malicious client detection method, device, electronic device and storage medium for encrypted federated learning | |
| Wang et al. | Spotting the Fakes: A Deep Dive into GAN-Generated Face Detection | |
| Patil et al. | Securing visual integrity: machine learning approaches for forged image detection | |
| CN111476668A (en) | Identification method and device of credible relationship, storage medium and computer equipment | |
| US11809594B2 (en) | Apparatus and method for securely classifying applications to posts using immutable sequential listings | |
| US12217476B1 (en) | Detection of synthetically generated images | |
| US11928748B1 (en) | Method and apparatus for scannable non-fungible token generation | |
| Farooqui et al. | Automatic detection of fake profiles in online social network using soft computing | |
| US20250384189A1 (en) | System and Method for Authentication of Rareness of a Digital Asset | |
| US20250217952A1 (en) | Multiple Fraud Type Detection System and Methods |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED |
|
| AS | Assignment |
Owner name: UBER TECHNOLOGIES, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LUO, TAO;WU, CHUANG;ZHANG, JINXUE;AND OTHERS;SIGNING DATES FROM 20200403 TO 20210107;REEL/FRAME:055084/0012 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PRE-INTERVIEW COMMUNICATION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |