[go: up one dir, main page]

US20200012963A1 - Curating Training Data For Incremental Re-Training Of A Predictive Model - Google Patents

Curating Training Data For Incremental Re-Training Of A Predictive Model Download PDF

Info

Publication number
US20200012963A1
US20200012963A1 US16/418,232 US201916418232A US2020012963A1 US 20200012963 A1 US20200012963 A1 US 20200012963A1 US 201916418232 A US201916418232 A US 201916418232A US 2020012963 A1 US2020012963 A1 US 2020012963A1
Authority
US
United States
Prior art keywords
data
labeled data
training data
instances
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/418,232
Inventor
David Alan Johnston
Shawn Ryan Jeffery
Vasileios Polychronopoulos
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ByteDance Inc
Original Assignee
Groupon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Groupon Inc filed Critical Groupon Inc
Priority to US16/418,232 priority Critical patent/US20200012963A1/en
Publication of US20200012963A1 publication Critical patent/US20200012963A1/en
Assigned to JPMORGAN CHASE BANK, N.A. reassignment JPMORGAN CHASE BANK, N.A. SECURITY INTEREST Assignors: GROUPON, INC., LIVINGSOCIAL, LLC
Assigned to GROUPON, INC. reassignment GROUPON, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: JOHNSTON, DAVID ALAN, JEFFERY, SHAWN RYAN
Assigned to GROUPON, INC. reassignment GROUPON, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: POLYCHRONOPOULOS, VASILEIOS
Assigned to LIVINGSOCIAL, LLC (F/K/A LIVINGSOCIAL, INC.), GROUPON, INC. reassignment LIVINGSOCIAL, LLC (F/K/A LIVINGSOCIAL, INC.) TERMINATION AND RELEASE OF SECURITY INTEREST IN INTELLECTUAL PROPERTY RIGHTS Assignors: JPMORGAN CHASE BANK, N.A.
Assigned to LIVINGSOCIAL, LLC (F/K/A LIVINGSOCIAL, INC.), GROUPON, INC. reassignment LIVINGSOCIAL, LLC (F/K/A LIVINGSOCIAL, INC.) RELEASE OF SECURITY INTEREST Assignors: JPMORGAN CHASE BANK, N.A.
Assigned to BYTEDANCE INC. reassignment BYTEDANCE INC. ASSIGNMENT OF ASSIGNOR'S INTEREST Assignors: GROUPON, INC., LIVINGSOCIAL, INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • Embodiments of the invention relate, generally, to curating training data used for incremental re-training of a predictive model via supervised learning.
  • Data being continuously sampled from a data stream is an example of dynamic data. Analysis of such dynamic data typically is based on data-driven statistical models that can be generated using machine learning.
  • the statistical distribution of the set of training data instances used to derive a predictive model using supervised learning should be an accurate representation of the distribution of unlabeled data that will be input to the model for processing. Additionally, the composition of a training data set should be structured to provide as much information as possible to the model.
  • dynamic data is inherently inconsistent. Data quality fluctuations may affect the performance of a statistical model, and, in some cases when the data quality and/or statistical distribution of the data has changed over time, the statistical model may have to be replaced by a different model that more closely fits the changed data.
  • embodiments of the present invention provide herein systems, methods and computer readable media for curating a training data set to ensure that training data being updated continuously from a data reservoir of verified possible training examples remain an accurate, high-quality representation of the distribution of data that are being input to a predictive model for processing.
  • a curated training data set thus will ensure a high probability of success for adaptive incremental re-training of the model to improve model performance.
  • FIG. 1 illustrates an example system configured to implement an adaptive crowd-trained learning framework that includes a curated training data set that is adapted for accurate representation of dynamic data analysis in accordance with some embodiments discussed herein;
  • FIG. 2 is an illustration of an example of the different effects of updating an exemplary training data set for a binary classification task using labeled data samples that have been respectively chosen from either active learning or dynamic data quality assessment in accordance with some embodiments discussed herein;
  • FIG. 3 is a flow diagram of an example method for automatic updating of a training data set based on incremental re-training of a predictive model in accordance with some embodiments discussed herein;
  • FIG. 4 illustrates a schematic block diagram of circuitry that can be included in a computing device, such as a training data manager module, in accordance with some embodiments discussed herein.
  • system components can be communicatively coupled to one or more of each other. Though the components are described as being separate or distinct, two or more of the components may be combined into a single process or routine.
  • the component functional descriptions provided herein including separation of responsibility for distinct functions is by way of example. Other groupings or other divisions of functional responsibilities can be made as necessary or in accordance with design preferences.
  • the terms “data,” “content,” “information” and similar terms may be used interchangeably to refer to data capable of being captured, transmitted, received, displayed and/or stored in accordance with various example embodiments. Thus, use of any such terms should not be taken to limit the spirit and scope of the disclosure.
  • a computing device is described herein to receive data from another computing device, the data may be received directly from the another computing device or may be received indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • the data may be sent directly to the another computing device or may be sent indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • intermediary computing devices such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • Data being continuously sampled from a data stream is an example of dynamic data.
  • the data stream may be generated from a data store while, in some alternative embodiments, the data stream may represent data collected from a variety of online sources (e.g., websites, blogs, and social media). Analysis of such dynamic data typically is based on data-driven statistical models that can be generated using machine learning.
  • machine learning is supervised learning, in which a statistical predictive model is derived based on a training data set of examples representing a particular modeling task to be performed by the model.
  • An exemplary particular task may be a binary classification task in which the predictive model (a binary classifier) returns a judgment as to which of two categories an input data instance most likely belongs.
  • the binary classifier is derived based on a set of labeled training data consisting of data instances, each instance being associated with a verified label identifying the category to which the instance belongs.
  • the labels associated with the training data set of examples have been verified by at least one reliable source of truth (an oracle, hereinafter) to ensure their accuracy.
  • an oracle may be a crowd, a flat file of data verification results previously received from one or more oracles, and/or data verification software.
  • the statistical distribution of the set of training data instances should be an accurate representation of the distribution of unlabeled data that will be input to the model for processing. Additionally, the composition of a training data set should be structured to provide as much information as possible to the model being derived. Obtaining a set of accurately distributed, high-quality training data instances (i.e., examples that are non-obvious or edge cases that could improve model performance through training) for derivation of a model is difficult, time-consuming, and/or expensive. For example, a training data set for a classification task should be balanced to ensure that examples of one category are not more frequent within the training data set than examples of the other categories, but assembling this distribution may be difficult if the frequency of one of the categories in a general data population is relatively rare.
  • derivation of a de-duplication classifier i.e., a classifier that detects duplicates
  • supervised learning requires a training data set that includes examples of duplicates as well as examples of non-duplicates, and obtaining enough high-quality examples of duplicates from a general population is particularly difficult.
  • Dynamic data is inherently inconsistent.
  • the quality of the data sources may vary, the quality of the data collection methods may vary, and, in the case of data being collected continuously from a data stream, the overall quality and statistical distribution of the data itself may vary over time.
  • Data quality fluctuations may affect the performance of a statistical model, and, in some cases when the data quality and/or statistical distribution of the data has changed over time, the performance of a statistical model may degrade so that the current model may have to be replaced by a different model that more closely fits the changed data.
  • the systems and methods described herein are therefore configured to curate a training data set to ensure that training data being updated continuously from a data reservoir of verified possible training examples remain an accurate, high-quality representation of the distribution of data that are being input to a predictive model for processing.
  • the training data are being adapted to fluctuations in quality and composition of the dynamic data being processed, the incremental re-training of a predictive model using a curated training data set will ensure a high probability of success in improving the model performance.
  • An adaptive crowd-trained learning framework for automatically building and maintaining a predictive statistical model may be used to perform analysis of dynamic data.
  • An exemplary adaptive crowd-trained framework is described in U.S. Provisional Patent Application No. 61/920,251, entitled “Processing Dynamic Data Using An Adaptive Crowd-Trained Learning System,” filed on Dec. 23, 2013, and which is incorporated herein in its entirety.
  • the framework monitors the performance of the model as new input data are processed and leverages active learning and an oracle to generate feedback about the changing data. Based on the feedback, current examples of the input data being processed are selected to be given true labels by the oracle. The resulting verified examples may be stored in a data reservoir, and the training data set used to derive the model may be updated continuously using these stored high-quality examples.
  • the training data set is curated to ensure that incremental re-training of the model using training data that are updated from the data reservoir will ensure a high probability of success in improving the model performance.
  • FIG. 1 illustrates an example system 100 configured to implement an adaptive crowd-trained learning framework that includes a curated training data set which is adapted for accurate representation of dynamic data analysis.
  • system 100 comprises a predictive model 120 (e.g., a classifier) that has been derived using supervised learning based on a set of training data 180 , and that is configured to generate a judgment about the input data 105 in response to receiving a feature representation of the input data 105 ; an input data analysis component 110 for generating a feature representation of the input data 105 ; a quality assurance component 140 for assessment of the quality of the input data 105 and of the quality of the judgments of the predictive model 120 ; an active learning component 130 to facilitate the generation and maintenance of optimized training data 120 ; at least one oracle 150 for providing a true label for input data 105 that has been selected as a possible training example by the active learning component 130 and/or the quality assurance component 140 ; a labeled data reservoir 160 for storing the labeled input data 105 received from the ora predictive
  • the predictive model 120 may be derived through supervised learning based on an initial training data set 180 which, in some embodiments, has been generated automatically within the adaptive crowd-trained learning framework 100 .
  • one or more high-quality initial training data sets may be generated automatically from a pool of unlabeled data instances.
  • the unlabeled data instances are dynamic data that have been collected from at least one data stream during at least one time window.
  • the collected data instances are multi-dimensional data, where each data instance is assumed to be described by a set of attributes (i.e., features hereinafter).
  • the sampled data instances are sent to an oracle 150 for labeling.
  • new unlabeled data instances 105 are input to the system 100 for processing by the predictive model 120 .
  • each new data instance 105 may include multi-dimensional data collected from one or more online sources describing a particular business (e.g., a restaurant, a spa), and the predictive model 120 may be a classifier that returns a judgment as to which of a set of categories the business belongs.
  • the predictive model 120 generates a judgment (e.g., a predicted label identifying a category if the model is a classifier) in response to receiving a feature representation of an unlabeled input data instance 105 .
  • the feature representation is generated during input data analysis 110 using a distribution-based feature analysis as described, for example, in U.S. patent application Ser. No. 14/038,661 entitled “Dynamic Clustering for Streaming Data,” filed on Sep. 16, 2013, and which is incorporated herein in its entirety.
  • the judgment generated by the predictive model 120 includes a confidence value.
  • the confidence value included with a classification judgment is a score representing the distance in decision space of the judgment from the task decision boundary. Classification judgments that are more certain are associated with higher confidence scores because those judgments are at greater distances in decision space from the task decision boundary.
  • an input data instance 105 and its associated judgment may be selected as a possible training example by an active learning component 130 .
  • Active learning as described, for example, in Settles, Burr (2009), “Active Learning Literature Survey”, Computer Sciences Technical Report 1648, University of Wisconsin- 13 Madison, is a semi-supervised learning process in which the distribution of training data set instances can be adjusted to optimally represent a machine learning problem. For example, a machine-learning algorithm may achieve greater accuracy with fewer training labels if the training data set instances are chosen to provide maximum information about the problem.
  • input data instances that result in classifier judgments that are closer to the decision boundary are more likely to provide maximum information about the classification task and thus may be recognized by an active learning component 130 as possible training examples.
  • an active learning component 130 may generate an accuracy assessment by calculating an accuracy assessment score combining model prediction accuracy and data quality.
  • an accuracy assessment may include identifying model prediction errors (e.g., a classifier model generates a predicted judgment with a high confidence value, but the judgment assigns the input data to the wrong category).
  • selection of a possible training example may be implemented by a quality assurance component 140 that monitors the quality of the predictive model performance as well as the quality of the input data being processed.
  • monitoring quality may be based on at least in part on comparing a calculated quality score to an accuracy threshold representing the system's desired accuracy.
  • the calculated quality score may include an accuracy assessment calculated by an active learning component 130 .
  • selected possible training examples are sent to an oracle 150 for verification, which includes the assignment of a true label to the input data instance.
  • the resulting labeled input data 155 are stored in a labeled data reservoir 160 .
  • the labeled data reservoir 160 grows continuously, and includes all possible training data samples that have been respectively selected by any one of multiple sources (e.g., the active learning component 130 and the quality assurance component 140 of system 100 , one-off cleaning tasks, and/or external collections of verified labeled data) for different purposes.
  • sources e.g., the active learning component 130 and the quality assurance component 140 of system 100 , one-off cleaning tasks, and/or external collections of verified labeled data
  • the training data manager 170 may update the training data 180 by selecting, from a labeled data reservoir 160 , an optimal subset of training data samples to use in a training data 180 update.
  • the criteria used by the training data manager 170 for selecting the optimal subset of training data samples may be based at least in part on the feature analysis used to generate the initial training data set used to derive the model.
  • the feature analysis includes clustering collected unlabeled data instances into homogeneous groups across multiple dimensions using an unsupervised learning approach that is dependent on the distribution of the input data as described, for example, in U.S. patent application Ser. No. 14/038,661.
  • the criteria used by the training data manager 170 may be based on attributes of a single cluster over time to ensure maintenance of model fidelity over time and/or maintenance of an accurate class balance for classification tasks.
  • the training data manager 170 selection criteria may be used to update the feature extraction criteria implemented by the input data analysis component 110 .
  • FIG. 2 is an illustration 200 of an example of the different effects of updating an exemplary training data set for a binary classification task using labeled data samples that have been respectively chosen from either active learning or dynamic data quality assessment.
  • a model i.e., a binary classifier
  • a judgment value of 0.5 represents a situation in which the classification decision was not certain (i.e., the predicted judgment is close to the decision boundary 215 for the classification task).
  • the dashed curve 240 represents the relative frequencies of new training data samples that would be added to a training data set for this binary classification problem by an active learning component. To enhance the performance of the classifier in situations where the decision was uncertain, the active learning component would choose the majority of new training data samples from input data that resulted in decisions near the decision boundary 215 .
  • the solid curve 230 represents the relative frequencies of new training data samples that would be added to the training data set by dynamic quality assessment.
  • dynamic quality assessment may choose the majority of new training data samples based on whether they statistically belong in the data reservoir distribution. It also may select new training data samples that were classified with certainty (i.e., having a judgment value close to either 0 or 1), but erroneously (e.g., samples in which the predicted judgment from the classifier did not match the result returned from the oracle).
  • FIG. 3 is a flow diagram of an example method 300 for automatic updating of a training data set based on incremental re-training of a predictive model.
  • the method 300 will be described with respect to a system that includes one or more computing devices and performs the method 300 . Specifically, the method will be described with respect to implementation by training data manager 170 within system 100 .
  • the system selects 310 a set of labeled data instances from a labeled data reservoir.
  • the data in the labeled data reservoir are not included in the training data.
  • a labeled data reservoir includes a pool of possible training data that have been collected continuously over time from input data being processed by the model. Each of the data instances in the reservoir has been assigned a true label (e.g., a verified category identifier for a classification task) by a trusted source (i.e., an oracle).
  • the labeled reservoir data have been collected as a result of having been selected, for different purposes, by one of multiple sources (e.g., the active learning component 130 and the quality assurance component 140 of system 100 ).
  • active learning may select possible training data instances from input data in which the predicted judgment is close to the decision boundary (thus providing maximum information about the task to the model), while dynamic quality assessment may select possible training data instances from input data based on a statistical decision.
  • selecting the set of labeled data instances from the labeled data reservoir is based on a determination that re-training the model with updated training data likely will result in improved model performance. In some embodiments, this determination is based at least in part on analyzing the distribution and quality of the training data. For example, in some embodiments in which the predictive model is a classifier, the selection may be based at least in part on maintenance and/or improvement of class balance in the training data (e.g., adding training examples of rare categories). In a second example, the selection may be based at least in part on adding examples having higher data quality than the training data. In a third example, the selection may be based at least in part on adding examples that have higher accuracy assessment scores, as previously described.
  • this determination is based on feedback signals received from one or more components of the system (e.g., the active learning component 130 and the quality assurance component 140 ) and/or data freshness (i.e., adding more newer data to a training data set than older data).
  • the system generates 315 at least one candidate training data set by updating the training data using the set of labeled data instances.
  • updating the training data may include pruning the training data set and replacing removed data with at least a subset of the selected labeled data.
  • pruning the training data set may include removing outliers from the training data.
  • removing outliers may be implemented on a per class basis (e.g., removing a training data sample describing a patient who has been classified as having a particular disease but has attributes that are inconsistent with the attributes describing other patients who have been classified as having that disease).
  • updating the training data may include pruning outliers from the selected labeled data before updating the training data.
  • the system derives multiple candidate models by deriving each model using differently updated sets of the received training data.
  • each of the differently updated training data sets may represent the current training data having been updated using a different subset of the selected set of labeled data instances.
  • updating the current training data may be based on a greedy algorithm in which new batches of training data instances are added incrementally to the training data set. Before each batch is added, a test is performed to determine if updating the training data by adding the batch will improve the model performance. Additionally and/or alternatively, in some embodiments, updating the current training data may be based on a non-greedy algorithm in which, for example, all the current training data are removed and replaced with a completely new set of training data.
  • the system derives 320 a candidate model using supervised learning. In embodiments, the system generates 325 an assessment of whether the candidate model performance is improved from the current model performance. In some embodiments, generating the assessment includes A/B testing in which the same set of data is input to the current model and to at least one candidate model that has been trained using candidate training data and then comparing the performance of the candidate model to the performance of the current model. In some embodiments, comparing the performance of the current model and a candidate model is implemented by cross-validation.
  • the model performs real time analysis of input data from a datastream (e.g., embodiments of dynamic data analysis system 100 )
  • the input datastream may be forked to multiple models so that A/B testing is implemented in parallel for all the models.
  • the system updates 320 the training data and instantiates a re-trained model in an instance in which the assessment indicates that re-training the current model using the updated training data results in improved model performance.
  • the system instantiates 335 the candidate updated training data used to derive the candidate model and the candidate model before the process ends 340 .
  • instantiating the updated training data includes removing the selected set of labeled data instances used to update the received training data from the labeled data reservoir.
  • the process ends 340 in an instance in which the assessment indicates that a candidate model performance is not improved from the current model performance 330 .
  • FIG. 4 shows a schematic block diagram of circuitry 400 , some or all of which may be included in, for example, adaptive crowd-trained learning framework system 100 .
  • circuitry 400 can include various means, such as processor 402 , memory 404 , communications module 406 , and/or input/output module 408 .
  • module includes hardware, software and/or firmware configured to perform one or more particular functions.
  • circuitry 400 as described herein may be embodied as, for example, circuitry, hardware elements (e.g., a suitably programmed processor, combinational logic circuit, and/or the like), a computer program product comprising computer-readable program instructions stored on a non-transitory computer-readable medium (e.g., memory 404 ) that is executable by a suitably configured processing device (e.g., processor 402 ), or some combination thereof.
  • a suitably configured processing device e.g., processor 402
  • Processor 402 may, for example, be embodied as various means including one or more microprocessors with accompanying digital signal processor(s), one or more processor(s) without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits such as, for example, an ASIC (application specific integrated circuit) or FPGA (field programmable gate array), or some combination thereof. Accordingly, although illustrated in FIG. 4 as a single processor, in some embodiments processor 402 comprises a plurality of processors. The plurality of processors may be embodied on a single computing device or may be distributed across a plurality of computing devices collectively configured to function as circuitry 400 .
  • the plurality of processors may be in operative communication with each other and may be collectively configured to perform one or more functionalities of circuitry 400 as described herein.
  • processor 402 is configured to execute instructions stored in memory 404 or otherwise accessible to processor 402 . These instructions, when executed by processor 402 , may cause circuitry 400 to perform one or more of the functionalities of circuitry 400 as described herein.
  • processor 402 may comprise an entity capable of performing operations according to embodiments of the present invention while configured accordingly.
  • processor 402 when processor 402 is embodied as an ASIC, FPGA or the like, processor 402 may comprise specifically configured hardware for conducting one or more operations described herein.
  • processor 402 when processor 402 is embodied as an executor of instructions, such as may be stored in memory 404 , the instructions may specifically configure processor 402 to perform one or more algorithms and operations described herein, such as those discussed in connection with FIG. 3 .
  • Memory 404 may comprise, for example, volatile memory, non-volatile memory, or some combination thereof. Although illustrated in FIG. 4 as a single memory, memory 404 may comprise a plurality of memory components. The plurality of memory components may be embodied on a single computing device or distributed across a plurality of computing devices. In various embodiments, memory 404 may comprise, for example, a hard disk, random access memory, cache memory, flash memory, a compact disc read only memory (CD-ROM), digital versatile disc read only memory (DVD-ROM), an optical disc, circuitry configured to store information, or some combination thereof. Memory 404 may be configured to store information, data (including analytics data), applications, instructions, or the like for enabling circuitry 400 to carry out various functions in accordance with example embodiments of the present invention.
  • memory 404 is configured to buffer input data for processing by processor 402 . Additionally or alternatively, in at least some embodiments, memory 404 is configured to store program instructions for execution by processor 402 . Memory 404 may store information in the form of static and/or dynamic information. This stored information may be stored and/or used by circuitry 400 during the course of performing its functionalities.
  • Communications module 406 may be embodied as any device or means embodied in circuitry, hardware, a computer program product comprising computer readable program instructions stored on a computer readable medium (e.g., memory 404 ) and executed by a processing device (e.g., processor 402 ), or a combination thereof that is configured to receive and/or transmit data from/to another device, such as, for example, a second circuitry 400 and/or the like.
  • communications module 406 (like other components discussed herein) can be at least partially embodied as or otherwise controlled by processor 402 .
  • communications module 406 may be in communication with processor 402 , such as via a bus.
  • Communications module 406 may include, for example, an antenna, a transmitter, a receiver, a transceiver, network interface card and/or supporting hardware and/or firmware/software for enabling communications with another computing device. Communications module 406 may be configured to receive and/or transmit any data that may be stored by memory 404 using any protocol that may be used for communications between computing devices. Communications module 406 may additionally or alternatively be in communication with the memory 404 , input/output module 408 and/or any other component of circuitry 400 , such as via a bus.
  • Input/output module 408 may be in communication with processor 402 to receive an indication of a user input and/or to provide an audible, visual, mechanical, or other output to a user. Some example visual outputs that may be provided to a user by circuitry 400 are discussed in connection with FIG. 1 .
  • input/output module 408 may include support, for example, for a keyboard, a mouse, a joystick, a display, a touch screen display, a microphone, a speaker, a RFID reader, barcode reader, biometric scanner, and/or other input/output mechanisms.
  • circuitry 400 is embodied as a server or database
  • aspects of input/output module 408 may be reduced as compared to embodiments where circuitry 400 is implemented as an end-user machine or other type of device designed for complex user interactions. In some embodiments (like other components discussed herein), input/output module 408 may even be eliminated from circuitry 400 .
  • circuitry 400 is embodied as a server or database
  • at least some aspects of input/output module 408 may be embodied on an apparatus used by a user that is in communication with circuitry 400 .
  • Input/output module 408 may be in communication with the memory 404 , communications module 406 , and/or any other component(s), such as via a bus. Although more than one input/output module and/or other component can be included in circuitry 400 , only one is shown in FIG. 4 to avoid overcomplicating the drawing (like the other components discussed herein).
  • Training data manager module 410 may also or instead be included and configured to perform the functionality discussed herein related to the training data curation discussed above. In some embodiments, some or all of the functionality of training data curation may be performed by processor 402 . In this regard, the example processes and algorithms discussed herein can be performed by at least one processor 402 and/or training data manager module 410 .
  • non-transitory computer readable media can be configured to store firmware, one or more application programs, and/or other software, which include instructions and other computer-readable program code portions that can be executed to control each processor (e.g., processor 402 and/or training data manager module 410 ) of the components of system 100 to implement various operations, including the examples shown above.
  • a series of computer-readable program code portions are embodied in one or more computer program products and can be used, with a computing device, server, and/or other programmable apparatus, to produce machine-implemented processes.
  • Any such computer program instructions and/or other type of code may be loaded onto a computer, processor or other programmable apparatus's circuitry to produce a machine, such that the computer, processor other programmable circuitry that execute the code on the machine create the means for implementing various functions, including those described herein.
  • all or some of the information presented by the example displays discussed herein can be based on data that is received, generated and/or maintained by one or more components of system 100 .
  • one or more external systems such as a remote cloud computing and/or data storage system may also be leveraged to provide at least some of the functionality discussed herein.
  • aspects of embodiments of the present invention may be configured as methods, mobile devices, backend network devices, and the like.
  • embodiments may comprise various means including entirely of hardware or any combination of software and hardware.
  • embodiments may take the form of a computer program product on at least one non-transitory computer-readable storage medium having computer-readable program instructions (e.g., computer software) embodied in the storage medium.
  • Any suitable computer-readable storage medium may be utilized including non-transitory hard disks, CD-ROMs, flash memory, optical storage devices, or magnetic storage devices.
  • These computer program instructions may also be stored in a computer-readable storage device (e.g., memory 404 ) that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable storage device produce an article of manufacture including computer-readable instructions for implementing the function discussed herein.
  • the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the functions discussed herein.
  • blocks of the block diagrams and flowchart illustrations support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the circuit diagrams and process flowcharts, and combinations of blocks in the circuit diagrams and process flowcharts, can be implemented by special purpose hardware-based computer systems that perform the specified functions or steps, or combinations of special purpose hardware and computer instructions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In general, embodiments of the present invention provide systems, methods and computer readable media for curating a training data set to ensure that training data being updated continuously from a data reservoir of verified possible training examples remain an accurate, high-quality representation of the distribution of data that are being input to a predictive model for processing.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. application Ser. No. 14/918,362, titled “CURATING TRAINING DATA FOR INCREMENTAL RE-TRAINING OF A PREDICTIVE MODEL,” filed Oct. 20, 2015, which claims the benefit of priority to U.S. Provisional Application No. 62/069,692, titled “CURATING TRAINING DATA FOR INCREMENTAL RE-TRAINING OF A PREDICTIVE MODEL,” and filed Oct. 28, 2014, the contents of which are hereby incorporated herein by reference in their entirety.
  • FIELD
  • Embodiments of the invention relate, generally, to curating training data used for incremental re-training of a predictive model via supervised learning.
  • BACKGROUND
  • Current methods for increasing the likelihood of successful incremental re-training of a predictive model exhibit a plurality of problems that make current systems insufficient, ineffective and/or the like. Through applied effort, ingenuity, and innovation, solutions to improve such methods have been realized and are described in connection with embodiments of the present invention.
  • SUMMARY
  • Data being continuously sampled from a data stream is an example of dynamic data. Analysis of such dynamic data typically is based on data-driven statistical models that can be generated using machine learning. The statistical distribution of the set of training data instances used to derive a predictive model using supervised learning should be an accurate representation of the distribution of unlabeled data that will be input to the model for processing. Additionally, the composition of a training data set should be structured to provide as much information as possible to the model. However, dynamic data is inherently inconsistent. Data quality fluctuations may affect the performance of a statistical model, and, in some cases when the data quality and/or statistical distribution of the data has changed over time, the statistical model may have to be replaced by a different model that more closely fits the changed data.
  • Obtaining a set of accurately distributed, high-quality training data instances for derivation of a model is difficult, time-consuming, and/or expensive. Once a model has been derived from an initial training data set, being able to perform real time monitoring of the performance of the model as well as to perform data quality assessments on dynamic data as it is being collected can enable the model to be adapted incrementally to fluctuations of quality and/or statistical distribution of dynamic data, thus reducing the cost involved in repeatedly replacing the model.
  • In general, embodiments of the present invention provide herein systems, methods and computer readable media for curating a training data set to ensure that training data being updated continuously from a data reservoir of verified possible training examples remain an accurate, high-quality representation of the distribution of data that are being input to a predictive model for processing. A curated training data set thus will ensure a high probability of success for adaptive incremental re-training of the model to improve model performance.
  • The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
  • BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)
  • Having thus described the invention in general terms, reference will now be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
  • FIG. 1 illustrates an example system configured to implement an adaptive crowd-trained learning framework that includes a curated training data set that is adapted for accurate representation of dynamic data analysis in accordance with some embodiments discussed herein;
  • FIG. 2 is an illustration of an example of the different effects of updating an exemplary training data set for a binary classification task using labeled data samples that have been respectively chosen from either active learning or dynamic data quality assessment in accordance with some embodiments discussed herein;
  • FIG. 3 is a flow diagram of an example method for automatic updating of a training data set based on incremental re-training of a predictive model in accordance with some embodiments discussed herein; and
  • FIG. 4 illustrates a schematic block diagram of circuitry that can be included in a computing device, such as a training data manager module, in accordance with some embodiments discussed herein.
  • DETAILED DESCRIPTION
  • The present invention now will be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. Indeed, this invention may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like numbers refer to like elements throughout.
  • As described herein, system components can be communicatively coupled to one or more of each other. Though the components are described as being separate or distinct, two or more of the components may be combined into a single process or routine. The component functional descriptions provided herein including separation of responsibility for distinct functions is by way of example. Other groupings or other divisions of functional responsibilities can be made as necessary or in accordance with design preferences.
  • As used herein, the terms “data,” “content,” “information” and similar terms may be used interchangeably to refer to data capable of being captured, transmitted, received, displayed and/or stored in accordance with various example embodiments. Thus, use of any such terms should not be taken to limit the spirit and scope of the disclosure. Further, where a computing device is described herein to receive data from another computing device, the data may be received directly from the another computing device or may be received indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like. Similarly, where a computing device is described herein to send data to another computing device, the data may be sent directly to the another computing device or may be sent indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • Data being continuously sampled from a data stream is an example of dynamic data. In some embodiments, the data stream may be generated from a data store while, in some alternative embodiments, the data stream may represent data collected from a variety of online sources (e.g., websites, blogs, and social media). Analysis of such dynamic data typically is based on data-driven statistical models that can be generated using machine learning. One type of machine learning is supervised learning, in which a statistical predictive model is derived based on a training data set of examples representing a particular modeling task to be performed by the model. An exemplary particular task may be a binary classification task in which the predictive model (a binary classifier) returns a judgment as to which of two categories an input data instance most likely belongs. Using supervised learning, the binary classifier is derived based on a set of labeled training data consisting of data instances, each instance being associated with a verified label identifying the category to which the instance belongs. Typically, the labels associated with the training data set of examples have been verified by at least one reliable source of truth (an oracle, hereinafter) to ensure their accuracy. For example, in embodiments, an oracle may be a crowd, a flat file of data verification results previously received from one or more oracles, and/or data verification software.
  • The statistical distribution of the set of training data instances should be an accurate representation of the distribution of unlabeled data that will be input to the model for processing. Additionally, the composition of a training data set should be structured to provide as much information as possible to the model being derived. Obtaining a set of accurately distributed, high-quality training data instances (i.e., examples that are non-obvious or edge cases that could improve model performance through training) for derivation of a model is difficult, time-consuming, and/or expensive. For example, a training data set for a classification task should be balanced to ensure that examples of one category are not more frequent within the training data set than examples of the other categories, but assembling this distribution may be difficult if the frequency of one of the categories in a general data population is relatively rare. In a second example, derivation of a de-duplication classifier (i.e., a classifier that detects duplicates) using supervised learning requires a training data set that includes examples of duplicates as well as examples of non-duplicates, and obtaining enough high-quality examples of duplicates from a general population is particularly difficult.
  • Dynamic data is inherently inconsistent. The quality of the data sources may vary, the quality of the data collection methods may vary, and, in the case of data being collected continuously from a data stream, the overall quality and statistical distribution of the data itself may vary over time. Data quality fluctuations may affect the performance of a statistical model, and, in some cases when the data quality and/or statistical distribution of the data has changed over time, the performance of a statistical model may degrade so that the current model may have to be replaced by a different model that more closely fits the changed data. Once a model has been derived from an initial training data set, being able to perform real time monitoring of the performance of the model as well as to perform data quality assessments on dynamic data as it is being collected can enable an instantiated model to be adapted incrementally to fluctuations of quality and/or statistical distribution of dynamic data, thus reducing the cost involved in repeatedly replacing the model.
  • As such, and according to some example embodiments, the systems and methods described herein are therefore configured to curate a training data set to ensure that training data being updated continuously from a data reservoir of verified possible training examples remain an accurate, high-quality representation of the distribution of data that are being input to a predictive model for processing. Thus, as the training data are being adapted to fluctuations in quality and composition of the dynamic data being processed, the incremental re-training of a predictive model using a curated training data set will ensure a high probability of success in improving the model performance.
  • An adaptive crowd-trained learning framework for automatically building and maintaining a predictive statistical model may be used to perform analysis of dynamic data. An exemplary adaptive crowd-trained framework is described in U.S. Provisional Patent Application No. 61/920,251, entitled “Processing Dynamic Data Using An Adaptive Crowd-Trained Learning System,” filed on Dec. 23, 2013, and which is incorporated herein in its entirety.
  • Once a predictive model is trained using an initial training data set, the framework monitors the performance of the model as new input data are processed and leverages active learning and an oracle to generate feedback about the changing data. Based on the feedback, current examples of the input data being processed are selected to be given true labels by the oracle. The resulting verified examples may be stored in a data reservoir, and the training data set used to derive the model may be updated continuously using these stored high-quality examples. In embodiments, the training data set is curated to ensure that incremental re-training of the model using training data that are updated from the data reservoir will ensure a high probability of success in improving the model performance.
  • For clarity, the inventions will be described for embodiments in which curating training data for incremental re-training of a predictive model is included within an adaptive crowd-trained learning framework. However, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims.
  • FIG. 1 illustrates an example system 100 configured to implement an adaptive crowd-trained learning framework that includes a curated training data set which is adapted for accurate representation of dynamic data analysis. In embodiments, system 100 comprises a predictive model 120 (e.g., a classifier) that has been derived using supervised learning based on a set of training data 180, and that is configured to generate a judgment about the input data 105 in response to receiving a feature representation of the input data 105; an input data analysis component 110 for generating a feature representation of the input data 105; a quality assurance component 140 for assessment of the quality of the input data 105 and of the quality of the judgments of the predictive model 120; an active learning component 130 to facilitate the generation and maintenance of optimized training data 120; at least one oracle 150 for providing a true label for input data 105 that has been selected as a possible training example by the active learning component 130 and/or the quality assurance component 140; a labeled data reservoir 160 for storing the labeled input data 105 received from the oracle 150; and a training data manager 170 for curating the set of training data 180 by updating the set of training data 180 using a subset of the labeled input data instances stored in the labeled data reservoir 160.
  • In embodiments, the predictive model 120 may be derived through supervised learning based on an initial training data set 180 which, in some embodiments, has been generated automatically within the adaptive crowd-trained learning framework 100. In some embodiments, one or more high-quality initial training data sets may be generated automatically from a pool of unlabeled data instances. In some embodiments, the unlabeled data instances are dynamic data that have been collected from at least one data stream during at least one time window. In some embodiments, the collected data instances are multi-dimensional data, where each data instance is assumed to be described by a set of attributes (i.e., features hereinafter). In some embodiments, the sampled data instances are sent to an oracle 150 for labeling.
  • In embodiments, new unlabeled data instances 105, sharing the particular type of the examples in the training data set 180, are input to the system 100 for processing by the predictive model 120. For example, in some embodiments, each new data instance 105 may include multi-dimensional data collected from one or more online sources describing a particular business (e.g., a restaurant, a spa), and the predictive model 120 may be a classifier that returns a judgment as to which of a set of categories the business belongs.
  • In embodiments, the predictive model 120 generates a judgment (e.g., a predicted label identifying a category if the model is a classifier) in response to receiving a feature representation of an unlabeled input data instance 105. In some embodiments, the feature representation is generated during input data analysis 110 using a distribution-based feature analysis as described, for example, in U.S. patent application Ser. No. 14/038,661 entitled “Dynamic Clustering for Streaming Data,” filed on Sep. 16, 2013, and which is incorporated herein in its entirety.
  • In some embodiments, the judgment generated by the predictive model 120 includes a confidence value. For example, in some embodiments in which the predictive model 120 is performing a classification task, the confidence value included with a classification judgment is a score representing the distance in decision space of the judgment from the task decision boundary. Classification judgments that are more certain are associated with higher confidence scores because those judgments are at greater distances in decision space from the task decision boundary.
  • In some embodiments, an input data instance 105 and its associated judgment may be selected as a possible training example by an active learning component 130. Active learning, as described, for example, in Settles, Burr (2009), “Active Learning Literature Survey”, Computer Sciences Technical Report 1648, University of Wisconsin-13 Madison, is a semi-supervised learning process in which the distribution of training data set instances can be adjusted to optimally represent a machine learning problem. For example, a machine-learning algorithm may achieve greater accuracy with fewer training labels if the training data set instances are chosen to provide maximum information about the problem. Referring to a classification task example, input data instances that result in classifier judgments that are closer to the decision boundary (e.g., judgments that are associated with lower confidence values as previously described) are more likely to provide maximum information about the classification task and thus may be recognized by an active learning component 130 as possible training examples.
  • In some embodiments, an active learning component 130 may generate an accuracy assessment by calculating an accuracy assessment score combining model prediction accuracy and data quality. In some embodiments, an accuracy assessment may include identifying model prediction errors (e.g., a classifier model generates a predicted judgment with a high confidence value, but the judgment assigns the input data to the wrong category).
  • In some embodiments, selection of a possible training example may be implemented by a quality assurance component 140 that monitors the quality of the predictive model performance as well as the quality of the input data being processed. In some embodiments, monitoring quality may be based on at least in part on comparing a calculated quality score to an accuracy threshold representing the system's desired accuracy. In some embodiments, the calculated quality score may include an accuracy assessment calculated by an active learning component 130.
  • In embodiments, selected possible training examples are sent to an oracle 150 for verification, which includes the assignment of a true label to the input data instance. The resulting labeled input data 155 are stored in a labeled data reservoir 160. In some embodiments, the labeled data reservoir 160 grows continuously, and includes all possible training data samples that have been respectively selected by any one of multiple sources (e.g., the active learning component 130 and the quality assurance component 140 of system 100, one-off cleaning tasks, and/or external collections of verified labeled data) for different purposes. The choices of quantity and/or types of sources for possible training data samples are not critical to the invention.
  • In embodiments, the training data manager 170 may update the training data 180 by selecting, from a labeled data reservoir 160, an optimal subset of training data samples to use in a training data 180 update. In embodiments in which the input data instances are multi-dimensional data, the criteria used by the training data manager 170 for selecting the optimal subset of training data samples may be based at least in part on the feature analysis used to generate the initial training data set used to derive the model. In some embodiments, the feature analysis includes clustering collected unlabeled data instances into homogeneous groups across multiple dimensions using an unsupervised learning approach that is dependent on the distribution of the input data as described, for example, in U.S. patent application Ser. No. 14/038,661. In these embodiments, the criteria used by the training data manager 170 may be based on attributes of a single cluster over time to ensure maintenance of model fidelity over time and/or maintenance of an accurate class balance for classification tasks. In some embodiments, the training data manager 170 selection criteria may be used to update the feature extraction criteria implemented by the input data analysis component 110.
  • FIG. 2 is an illustration 200 of an example of the different effects of updating an exemplary training data set for a binary classification task using labeled data samples that have been respectively chosen from either active learning or dynamic data quality assessment. A model (i.e., a binary classifier) assigns a predicted judgment value 210 to data sample; a data sample assigned a judgment value that is close to either 0 or 1 has been determined with certainty by the classifier to belong to one or the other of two classes. A judgment value of 0.5 represents a situation in which the classification decision was not certain (i.e., the predicted judgment is close to the decision boundary 215 for the classification task).
  • The dashed curve 240 represents the relative frequencies of new training data samples that would be added to a training data set for this binary classification problem by an active learning component. To enhance the performance of the classifier in situations where the decision was uncertain, the active learning component would choose the majority of new training data samples from input data that resulted in decisions near the decision boundary 215.
  • The solid curve 230 represents the relative frequencies of new training data samples that would be added to the training data set by dynamic quality assessment. Instead of choosing new training data samples based on the judgment value, in some embodiments, dynamic quality assessment may choose the majority of new training data samples based on whether they statistically belong in the data reservoir distribution. It also may select new training data samples that were classified with certainty (i.e., having a judgment value close to either 0 or 1), but erroneously (e.g., samples in which the predicted judgment from the classifier did not match the result returned from the oracle).
  • FIG. 3 is a flow diagram of an example method 300 for automatic updating of a training data set based on incremental re-training of a predictive model. For convenience, the method 300 will be described with respect to a system that includes one or more computing devices and performs the method 300. Specifically, the method will be described with respect to implementation by training data manager 170 within system 100.
  • In embodiments, after receiving 305 training data and a current model derived using the training data, the system selects 310 a set of labeled data instances from a labeled data reservoir. In some embodiments, the data in the labeled data reservoir are not included in the training data. As previously described, in some embodiments, a labeled data reservoir includes a pool of possible training data that have been collected continuously over time from input data being processed by the model. Each of the data instances in the reservoir has been assigned a true label (e.g., a verified category identifier for a classification task) by a trusted source (i.e., an oracle). In some embodiments, the labeled reservoir data have been collected as a result of having been selected, for different purposes, by one of multiple sources (e.g., the active learning component 130 and the quality assurance component 140 of system 100). For example, referring to the exemplary binary classification task 200, active learning may select possible training data instances from input data in which the predicted judgment is close to the decision boundary (thus providing maximum information about the task to the model), while dynamic quality assessment may select possible training data instances from input data based on a statistical decision.
  • In embodiments, selecting the set of labeled data instances from the labeled data reservoir is based on a determination that re-training the model with updated training data likely will result in improved model performance. In some embodiments, this determination is based at least in part on analyzing the distribution and quality of the training data. For example, in some embodiments in which the predictive model is a classifier, the selection may be based at least in part on maintenance and/or improvement of class balance in the training data (e.g., adding training examples of rare categories). In a second example, the selection may be based at least in part on adding examples having higher data quality than the training data. In a third example, the selection may be based at least in part on adding examples that have higher accuracy assessment scores, as previously described. Additionally and/or alternatively, in some embodiments in which curating training data is implemented within an adaptive dynamic data analysis system (e.g., system 100), this determination is based on feedback signals received from one or more components of the system (e.g., the active learning component 130 and the quality assurance component 140) and/or data freshness (i.e., adding more newer data to a training data set than older data).
  • In embodiments, the system generates 315 at least one candidate training data set by updating the training data using the set of labeled data instances. In embodiments, updating the training data may include pruning the training data set and replacing removed data with at least a subset of the selected labeled data. In some embodiments, pruning the training data set may include removing outliers from the training data. In some embodiments in which the model is a classifier, removing outliers may be implemented on a per class basis (e.g., removing a training data sample describing a patient who has been classified as having a particular disease but has attributes that are inconsistent with the attributes describing other patients who have been classified as having that disease). Additionally and/or alternatively, updating the training data may include pruning outliers from the selected labeled data before updating the training data.
  • In some embodiments, the system derives multiple candidate models by deriving each model using differently updated sets of the received training data. In some embodiments, each of the differently updated training data sets may represent the current training data having been updated using a different subset of the selected set of labeled data instances. In some embodiments, updating the current training data may be based on a greedy algorithm in which new batches of training data instances are added incrementally to the training data set. Before each batch is added, a test is performed to determine if updating the training data by adding the batch will improve the model performance. Additionally and/or alternatively, in some embodiments, updating the current training data may be based on a non-greedy algorithm in which, for example, all the current training data are removed and replaced with a completely new set of training data.
  • In embodiments, for each candidate updated training data set, the system derives 320 a candidate model using supervised learning. In embodiments, the system generates 325 an assessment of whether the candidate model performance is improved from the current model performance. In some embodiments, generating the assessment includes A/B testing in which the same set of data is input to the current model and to at least one candidate model that has been trained using candidate training data and then comparing the performance of the candidate model to the performance of the current model. In some embodiments, comparing the performance of the current model and a candidate model is implemented by cross-validation.
  • There are a variety of well-known statistical techniques for comparing results; the choice of statistical technique for comparing the performance of models is not critical to the invention.
  • In some embodiments in which the model performs real time analysis of input data from a datastream (e.g., embodiments of dynamic data analysis system 100), the input datastream may be forked to multiple models so that A/B testing is implemented in parallel for all the models. In embodiments, the system updates 320 the training data and instantiates a re-trained model in an instance in which the assessment indicates that re-training the current model using the updated training data results in improved model performance.
  • In an instance in which the assessment indicates that a re-trained model performance is improved from the current model performance 330, the system instantiates 335 the candidate updated training data used to derive the candidate model and the candidate model before the process ends 340. In some embodiments in which the training data are not included in the labeled data reservoir, instantiating the updated training data includes removing the selected set of labeled data instances used to update the received training data from the labeled data reservoir.
  • The process ends 340 in an instance in which the assessment indicates that a candidate model performance is not improved from the current model performance 330.
  • FIG. 4 shows a schematic block diagram of circuitry 400, some or all of which may be included in, for example, adaptive crowd-trained learning framework system 100. As illustrated in FIG. 4, in accordance with some example embodiments, circuitry 400 can include various means, such as processor 402, memory 404, communications module 406, and/or input/output module 408. As referred to herein, “module” includes hardware, software and/or firmware configured to perform one or more particular functions. In this regard, the means of circuitry 400 as described herein may be embodied as, for example, circuitry, hardware elements (e.g., a suitably programmed processor, combinational logic circuit, and/or the like), a computer program product comprising computer-readable program instructions stored on a non-transitory computer-readable medium (e.g., memory 404) that is executable by a suitably configured processing device (e.g., processor 402), or some combination thereof.
  • Processor 402 may, for example, be embodied as various means including one or more microprocessors with accompanying digital signal processor(s), one or more processor(s) without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits such as, for example, an ASIC (application specific integrated circuit) or FPGA (field programmable gate array), or some combination thereof. Accordingly, although illustrated in FIG. 4 as a single processor, in some embodiments processor 402 comprises a plurality of processors. The plurality of processors may be embodied on a single computing device or may be distributed across a plurality of computing devices collectively configured to function as circuitry 400. The plurality of processors may be in operative communication with each other and may be collectively configured to perform one or more functionalities of circuitry 400 as described herein. In an example embodiment, processor 402 is configured to execute instructions stored in memory 404 or otherwise accessible to processor 402. These instructions, when executed by processor 402, may cause circuitry 400 to perform one or more of the functionalities of circuitry 400 as described herein.
  • Whether configured by hardware, firmware/software methods, or by a combination thereof, processor 402 may comprise an entity capable of performing operations according to embodiments of the present invention while configured accordingly. Thus, for example, when processor 402 is embodied as an ASIC, FPGA or the like, processor 402 may comprise specifically configured hardware for conducting one or more operations described herein. Alternatively, as another example, when processor 402 is embodied as an executor of instructions, such as may be stored in memory 404, the instructions may specifically configure processor 402 to perform one or more algorithms and operations described herein, such as those discussed in connection with FIG. 3.
  • Memory 404 may comprise, for example, volatile memory, non-volatile memory, or some combination thereof. Although illustrated in FIG. 4 as a single memory, memory 404 may comprise a plurality of memory components. The plurality of memory components may be embodied on a single computing device or distributed across a plurality of computing devices. In various embodiments, memory 404 may comprise, for example, a hard disk, random access memory, cache memory, flash memory, a compact disc read only memory (CD-ROM), digital versatile disc read only memory (DVD-ROM), an optical disc, circuitry configured to store information, or some combination thereof. Memory 404 may be configured to store information, data (including analytics data), applications, instructions, or the like for enabling circuitry 400 to carry out various functions in accordance with example embodiments of the present invention. For example, in at least some embodiments, memory 404 is configured to buffer input data for processing by processor 402. Additionally or alternatively, in at least some embodiments, memory 404 is configured to store program instructions for execution by processor 402. Memory 404 may store information in the form of static and/or dynamic information. This stored information may be stored and/or used by circuitry 400 during the course of performing its functionalities.
  • Communications module 406 may be embodied as any device or means embodied in circuitry, hardware, a computer program product comprising computer readable program instructions stored on a computer readable medium (e.g., memory 404) and executed by a processing device (e.g., processor 402), or a combination thereof that is configured to receive and/or transmit data from/to another device, such as, for example, a second circuitry 400 and/or the like. In some embodiments, communications module 406 (like other components discussed herein) can be at least partially embodied as or otherwise controlled by processor 402. In this regard, communications module 406 may be in communication with processor 402, such as via a bus. Communications module 406 may include, for example, an antenna, a transmitter, a receiver, a transceiver, network interface card and/or supporting hardware and/or firmware/software for enabling communications with another computing device. Communications module 406 may be configured to receive and/or transmit any data that may be stored by memory 404 using any protocol that may be used for communications between computing devices. Communications module 406 may additionally or alternatively be in communication with the memory 404, input/output module 408 and/or any other component of circuitry 400, such as via a bus.
  • Input/output module 408 may be in communication with processor 402 to receive an indication of a user input and/or to provide an audible, visual, mechanical, or other output to a user. Some example visual outputs that may be provided to a user by circuitry 400 are discussed in connection with FIG. 1. As such, input/output module 408 may include support, for example, for a keyboard, a mouse, a joystick, a display, a touch screen display, a microphone, a speaker, a RFID reader, barcode reader, biometric scanner, and/or other input/output mechanisms. In embodiments wherein circuitry 400 is embodied as a server or database, aspects of input/output module 408 may be reduced as compared to embodiments where circuitry 400 is implemented as an end-user machine or other type of device designed for complex user interactions. In some embodiments (like other components discussed herein), input/output module 408 may even be eliminated from circuitry 400. Alternatively, such as in embodiments wherein circuitry 400 is embodied as a server or database, at least some aspects of input/output module 408 may be embodied on an apparatus used by a user that is in communication with circuitry 400. Input/output module 408 may be in communication with the memory 404, communications module 406, and/or any other component(s), such as via a bus. Although more than one input/output module and/or other component can be included in circuitry 400, only one is shown in FIG. 4 to avoid overcomplicating the drawing (like the other components discussed herein).
  • Training data manager module 410 may also or instead be included and configured to perform the functionality discussed herein related to the training data curation discussed above. In some embodiments, some or all of the functionality of training data curation may be performed by processor 402. In this regard, the example processes and algorithms discussed herein can be performed by at least one processor 402 and/or training data manager module 410. For example, non-transitory computer readable media can be configured to store firmware, one or more application programs, and/or other software, which include instructions and other computer-readable program code portions that can be executed to control each processor (e.g., processor 402 and/or training data manager module 410) of the components of system 100 to implement various operations, including the examples shown above. As such, a series of computer-readable program code portions are embodied in one or more computer program products and can be used, with a computing device, server, and/or other programmable apparatus, to produce machine-implemented processes.
  • Any such computer program instructions and/or other type of code may be loaded onto a computer, processor or other programmable apparatus's circuitry to produce a machine, such that the computer, processor other programmable circuitry that execute the code on the machine create the means for implementing various functions, including those described herein.
  • It is also noted that all or some of the information presented by the example displays discussed herein can be based on data that is received, generated and/or maintained by one or more components of system 100. In some embodiments, one or more external systems (such as a remote cloud computing and/or data storage system) may also be leveraged to provide at least some of the functionality discussed herein.
  • As described above in this disclosure, aspects of embodiments of the present invention may be configured as methods, mobile devices, backend network devices, and the like.
  • Accordingly, embodiments may comprise various means including entirely of hardware or any combination of software and hardware. Furthermore, embodiments may take the form of a computer program product on at least one non-transitory computer-readable storage medium having computer-readable program instructions (e.g., computer software) embodied in the storage medium. Any suitable computer-readable storage medium may be utilized including non-transitory hard disks, CD-ROMs, flash memory, optical storage devices, or magnetic storage devices.
  • Embodiments of the present invention have been described above with reference to block diagrams and flowchart illustrations of methods, apparatuses, systems and computer program products. It will be understood that each block of the circuit diagrams and process flow diagrams, and combinations of blocks in the circuit diagrams and process flowcharts, respectively, can be implemented by various means including computer program instructions. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus, such as processor 402 and/or training data manager module 410 discussed above with reference to FIG. 4, to produce a machine, such that the computer program product includes the instructions which execute on the computer or other programmable data processing apparatus create a means for implementing the functions specified in the flowchart block or blocks.
  • These computer program instructions may also be stored in a computer-readable storage device (e.g., memory 404) that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable storage device produce an article of manufacture including computer-readable instructions for implementing the function discussed herein. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the functions discussed herein.
  • Accordingly, blocks of the block diagrams and flowchart illustrations support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the circuit diagrams and process flowcharts, and combinations of blocks in the circuit diagrams and process flowcharts, can be implemented by special purpose hardware-based computer systems that perform the specified functions or steps, or combinations of special purpose hardware and computer instructions
  • Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims (24)

1. A computer-implemented method for adaptively improving the performance of a current predictive model, the method comprising:
selecting a set of labeled data instances from a labeled data reservoir, the labeled data reservoir comprising a pool of possible training data, wherein selecting the set of labeled data instances is based on a determination that re-training a current predictive model with updated training data likely will result in improved model performance;
generating a candidate model using at least one candidate training data set generated based at least in part on a received training data set and the set of labeled data instances;
in an instance in which a performance of the candidate model is improved from a performance of the current predictive model,
instantiating the candidate training data set and the candidate model.
2-13. (canceled)
14. A computer program product, stored on a non-transitory computer readable medium, comprising instructions that when executed on one or more computers cause the one or more computers to perform operations comprising:
selecting a set of labeled data instances from a labeled data reservoir, the labeled data reservoir comprising a pool of possible training data, wherein selecting the set of labeled data instances is based on a determination that re-training a current predictive model with updated training data likely will result in improved model performance;
generating a candidate model using at least one candidate training data set generated based at least in part on a received training data set and the set of labeled data instances;
in an instance in which a performance of the candidate model is improved from a performance of the current predictive model,
instantiating the candidate training data set and the candidate model.
15. The computer program product of claim 14, wherein the labeled data reservoir comprises data that have been collected continuously over time from input data being processed by the current predictive model.
16. The computer program product of claim 14,
wherein the set of labeled data instances is not included in the received training data set, and wherein each labeled data instance is associated with a true label representing the data instance.
17. The computer program product of claim 16, wherein the determination is based at least in part on a distribution and quality of the training data set.
18. The computer program product of claim 17, wherein the current predictive model is a classifier predicting to which of a set of predictive categories an input data instance belongs, wherein a true label associated with a labeled data instance identifies the predictive category to which the labeled data instance belongs, and wherein selecting the set of labeled data instances from the labeled data reservoir is based at least in part on maintaining a class balance within the training data.
19. The computer program product of claim 14, wherein generating the candidate training data comprises identifying and removing outlier instances.
20. The computer program product of claim 19, wherein the current predictive model is a classifier predicting to which of a set of predictive categories an input data instance belongs, and wherein selecting the set of labeled data instances from the labeled data reservoir comprises identifying and removing outlier instances in one predictive category.
21. The computer program product of claim 14, wherein the labeled data reservoir comprises labeled data instances that are received from multiple sources, and wherein selecting a labeled data instance from the set of labeled data instances comprises:
selecting the labeled data instance in an instance in which a source of the labeled data instance matches with a pre-determined source.
22. The computer program product of claim 14, wherein generating at least one candidate training data set is based on a greedy algorithm, the generating comprising:
generating a first candidate training data set by adding a first subset of the labeled data instances to the training data; and
generating a second candidate training data set by adding a second subset of the labeled data instances to the first candidate training data set.
23. The computer program product of claim 14, wherein generating at least one candidate training data set is based on a non-greedy algorithm, the generating comprising:
replacing the training data with a subset of the labeled data instances.
24. (canceled)
25. The computer program product of claim 24, wherein generating the assessment comprises calculating a cross-validation between the candidate model performance and the current predictive model performance.
26. The computer program product of claim 24, wherein there are multiple candidate models, and wherein generating the assessment for each of the multiple candidate models is implemented in parallel.
27. A system, comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
selecting a set of labeled data instances from a labeled data reservoir, the labeled data reservoir comprising a pool of possible training data, wherein selecting the set of labeled data instances is based on a determination that re-training a current predictive model with updated training data likely will result in improved model performance;
generating a candidate model using at least one candidate training data set generated based at least in part on a received training data set and the set of labeled data instances;
in an instance in which a performance of the candidate model is improved from a performance of the current predictive model,
instantiating the candidate training data set and the candidate model.
28. The system of claim 27, wherein the labeled data reservoir comprises data that have been collected continuously over time from input data being processed by the current predictive model.
29. The system of claim 27,
wherein the set of labeled data instances is not included in the received training data set, and wherein each labeled data instance is associated with a true label representing the data instance.
30. The system of claim 29, wherein the determination is based at least in part on a distribution and quality of the training data.
31. The system of claim 30, wherein the current predictive model is a classifier predicting to which of a set of predictive categories an input data instance belongs, wherein a true label associated with a labeled data instance identifies the predictive category to which the labeled data instance belongs, and wherein selecting the set of labeled data instances from the labeled data reservoir is based at least in part on maintaining a class balance within the training data.
32. The system of claim 27, wherein generating the candidate training data comprises identifying and removing outlier instances.
33. The system of claim 32, wherein the current predictive model is a classifier predicting to which of a set of predictive categories an input data instance belongs, and wherein selecting the set of labeled data instances from the labeled data reservoir comprises identifying and removing outlier instances in one predictive category.
34. The system of claim 27, wherein the labeled data reservoir comprises labeled data instances that are received from multiple sources, and wherein selecting a labeled data instance from the set of labeled data instances comprises: selecting the labeled data instance in an instance in which a source of the labeled data instance matches with a pre-determined source.
35-39. (canceled)
US16/418,232 2014-10-28 2019-05-21 Curating Training Data For Incremental Re-Training Of A Predictive Model Abandoned US20200012963A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/418,232 US20200012963A1 (en) 2014-10-28 2019-05-21 Curating Training Data For Incremental Re-Training Of A Predictive Model

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201462069692P 2014-10-28 2014-10-28
US14/918,362 US10339468B1 (en) 2014-10-28 2015-10-20 Curating training data for incremental re-training of a predictive model
US16/418,232 US20200012963A1 (en) 2014-10-28 2019-05-21 Curating Training Data For Incremental Re-Training Of A Predictive Model

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/918,362 Continuation US10339468B1 (en) 2014-10-28 2015-10-20 Curating training data for incremental re-training of a predictive model

Publications (1)

Publication Number Publication Date
US20200012963A1 true US20200012963A1 (en) 2020-01-09

Family

ID=67069365

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/918,362 Active 2037-07-23 US10339468B1 (en) 2014-10-28 2015-10-20 Curating training data for incremental re-training of a predictive model
US16/418,232 Abandoned US20200012963A1 (en) 2014-10-28 2019-05-21 Curating Training Data For Incremental Re-Training Of A Predictive Model

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US14/918,362 Active 2037-07-23 US10339468B1 (en) 2014-10-28 2015-10-20 Curating training data for incremental re-training of a predictive model

Country Status (1)

Country Link
US (2) US10339468B1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112132291A (en) * 2020-08-21 2020-12-25 北京艾巴斯智能科技发展有限公司 Intelligent brain optimization method and device applied to government affair system, storage medium and terminal
US11061790B2 (en) * 2019-01-07 2021-07-13 International Business Machines Corporation Providing insight of continuous delivery pipeline using machine learning
US20210224687A1 (en) * 2020-01-17 2021-07-22 Apple Inc. Automated input-data monitoring to dynamically adapt machine-learning techniques
US20220207059A1 (en) * 2020-12-29 2022-06-30 Landmark Graphics Corporation Smart data warehouse for cloud-based reservoir simulation
US20220318654A1 (en) * 2021-03-30 2022-10-06 Paypal, Inc. Machine Learning and Reject Inference Techniques Utilizing Attributes of Unlabeled Data Samples
WO2023096968A1 (en) * 2021-11-23 2023-06-01 Strong Force Tp Portfolio 2022, Llc Intelligent transportation methods and systems
US20240320206A1 (en) * 2023-03-24 2024-09-26 Gm Cruise Holdings Llc Identifying quality of labeled data
US20240328803A1 (en) * 2021-04-28 2024-10-03 Tomtom Navigation B.V. Methods and Systems for Determining Estimated Travel Times Through a Navigable Network
US12154391B2 (en) 2018-09-30 2024-11-26 Strong Force Tp Portfolio 2022, Llc Intelligent transportation systems including digital twin interface for a passenger vehicle
EP4507260A1 (en) * 2023-08-08 2025-02-12 Nokia Solutions and Networks Oy Apparatuses and methods for selecting candidates end points in a fixed communication network
US12380746B2 (en) 2018-09-30 2025-08-05 Strong Force Tp Portfolio 2022, Llc Digital twin systems and methods for transportation systems

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10949889B2 (en) 2016-01-04 2021-03-16 Exelate Media Ltd. Methods and apparatus for managing models for classification of online users
US10699184B2 (en) * 2016-12-29 2020-06-30 Facebook, Inc. Updating predictions for a deep-learning model
US20180330272A1 (en) * 2017-05-09 2018-11-15 Microsoft Technology Licensing, Llc Method of Adding Classes to Classifier
US11410073B1 (en) * 2017-05-31 2022-08-09 The Mathworks, Inc. Systems and methods for robust feature selection
US10803105B1 (en) 2017-08-03 2020-10-13 Tamr, Inc. Computer-implemented method for performing hierarchical classification
US11663517B2 (en) * 2017-11-03 2023-05-30 Salesforce, Inc. Automatic machine learning model generation
US10936704B2 (en) * 2018-02-21 2021-03-02 International Business Machines Corporation Stolen machine learning model identification
US11436527B2 (en) * 2018-06-01 2022-09-06 Nami Ml Inc. Machine learning at edge devices based on distributed feedback
US11164099B2 (en) * 2019-02-19 2021-11-02 International Business Machines Corporation Quantum space distance estimation for classifier training using hybrid classical-quantum computing system
US11611569B2 (en) * 2019-05-31 2023-03-21 Micro Focus Llc Machine learning-based network device profiling
US20210064984A1 (en) * 2019-08-29 2021-03-04 Sap Se Engagement prediction using machine learning in digital workplace
CN110689135B (en) * 2019-09-05 2022-10-11 第四范式(北京)技术有限公司 Anti-money laundering model training method and device and electronic equipment
JP7276487B2 (en) * 2019-10-23 2023-05-18 富士通株式会社 Creation method, creation program and information processing device
CN111177136B (en) * 2019-12-27 2023-04-18 上海依图网络科技有限公司 Device and method for washing label data
US20210256369A1 (en) * 2020-02-18 2021-08-19 SparkCognition, Inc. Domain-adapted classifier generation
US11455534B2 (en) 2020-06-09 2022-09-27 Macronix International Co., Ltd. Data set cleaning for artificial neural network training
CN111652327A (en) * 2020-07-16 2020-09-11 北京思图场景数据科技服务有限公司 A model iteration method, system and computer equipment
US12130815B2 (en) * 2020-08-11 2024-10-29 Castellum.AI System and method for processing data for electronic searching
CN112269794B (en) * 2020-09-16 2024-09-17 连尚(新昌)网络科技有限公司 A method and device for predicting violations based on blockchain
CN112308237B (en) * 2020-10-30 2023-09-26 平安科技(深圳)有限公司 Question-answer data enhancement method and device, computer equipment and storage medium
US12288139B2 (en) * 2020-12-02 2025-04-29 Sap Se Iterative machine learning and relearning
CN113505805B (en) * 2021-05-25 2023-10-13 平安银行股份有限公司 Sample data closed-loop generation method, device, equipment and storage medium
US20230064674A1 (en) * 2021-08-31 2023-03-02 International Business Machines Corporation Iterative training of computer model for machine learning
US20240379209A1 (en) * 2023-05-09 2024-11-14 Apple Inc. Automated schedule generation with content filling
CN116805157B (en) * 2023-08-25 2023-11-17 中国人民解放军国防科技大学 Unmanned cluster autonomous dynamic assessment method and device
CN118781470B (en) * 2024-09-12 2025-02-14 深圳市傲天科技股份有限公司 Image processing model training method, device, equipment, storage medium and computer program product

Family Cites Families (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5212765A (en) 1990-08-03 1993-05-18 E. I. Du Pont De Nemours & Co., Inc. On-line training neural network system for process control
US5825646A (en) 1993-03-02 1998-10-20 Pavilion Technologies, Inc. Method and apparatus for determining the sensitivity of inputs to a neural network on output parameters
US8311673B2 (en) 1996-05-06 2012-11-13 Rockwell Automation Technologies, Inc. Method and apparatus for minimizing error in dynamic and steady-state processes for prediction, control, and optimization
US6092072A (en) 1998-04-07 2000-07-18 Lucent Technologies, Inc. Programmed medium for clustering large databases
US7418431B1 (en) 1999-09-30 2008-08-26 Fair Isaac Corporation Webstation: configurable web-based workstation for reason driven data analysis
US20030130899A1 (en) 2002-01-08 2003-07-10 Bruce Ferguson System and method for historical database training of non-linear models for use in electronic commerce
US20040205482A1 (en) 2002-01-24 2004-10-14 International Business Machines Corporation Method and apparatus for active annotation of multimedia content
US7565304B2 (en) * 2002-06-21 2009-07-21 Hewlett-Packard Development Company, L.P. Business processes based on a predictive model
US20040049473A1 (en) 2002-09-05 2004-03-11 David John Gower Information analytics systems and methods
US7490071B2 (en) 2003-08-29 2009-02-10 Oracle Corporation Support vector machines processing system
US7512582B2 (en) 2003-12-10 2009-03-31 Microsoft Corporation Uncertainty reduction in collaborative bootstrapping
US7480640B1 (en) * 2003-12-16 2009-01-20 Quantum Leap Research, Inc. Automated method and system for generating models from data
US7599897B2 (en) 2006-05-05 2009-10-06 Rockwell Automation Technologies, Inc. Training a support vector machine with process constraints
US7925620B1 (en) 2006-08-04 2011-04-12 Hyoungsoo Yoon Contact information management
US7756800B2 (en) 2006-12-14 2010-07-13 Xerox Corporation Method for transforming data elements within a classification system based in part on input from a human annotator/expert
WO2008137978A1 (en) 2007-05-08 2008-11-13 The University Of Vermont And State Agricultural College Systems and methods for reservoir sampling of streaming data and stream joins
WO2009061399A1 (en) 2007-11-05 2009-05-14 Nagaraju Bandaru Method for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis
US8799773B2 (en) 2008-01-25 2014-08-05 Google Inc. Aspect-based sentiment summarization
US20110191335A1 (en) 2010-01-29 2011-08-04 Lexisnexis Risk Data Management Inc. Method and system for conducting legal research using clustering analytics
US9165051B2 (en) 2010-08-24 2015-10-20 Board Of Trustees Of The University Of Illinois Systems and methods for detecting a novel data class
US9390194B2 (en) 2010-08-31 2016-07-12 International Business Machines Corporation Multi-faceted visualization of rich text corpora
US8402543B1 (en) * 2011-03-25 2013-03-19 Narus, Inc. Machine learning based botnet detection with dynamic adaptation
US20120254186A1 (en) 2011-03-31 2012-10-04 Nokia Corporation Method and apparatus for rendering categorized location-based search results
US8533224B2 (en) 2011-05-04 2013-09-10 Google Inc. Assessing accuracy of trained predictive models
US8843427B1 (en) 2011-07-01 2014-09-23 Google Inc. Predictive modeling accuracy
US9349103B2 (en) 2012-01-09 2016-05-24 DecisionQ Corporation Application of machine learned Bayesian networks to detection of anomalies in complex systems
US9015080B2 (en) 2012-03-16 2015-04-21 Orbis Technologies, Inc. Systems and methods for semantic inference and reasoning
US9715723B2 (en) 2012-04-19 2017-07-25 Applied Materials Israel Ltd Optimization of unknown defect rejection for automatic defect classification
US20140101544A1 (en) 2012-10-08 2014-04-10 Microsoft Corporation Displaying information according to selected entity type
US9501799B2 (en) 2012-11-08 2016-11-22 Hartford Fire Insurance Company System and method for determination of insurance classification of entities
US20140172767A1 (en) 2012-12-14 2014-06-19 Microsoft Corporation Budget optimal crowdsourcing
US20140279745A1 (en) 2013-03-14 2014-09-18 Sm4rt Predictive Systems Classification based on prediction of accuracy of multiple data models
US9122681B2 (en) 2013-03-15 2015-09-01 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
US9390378B2 (en) 2013-03-28 2016-07-12 Wal-Mart Stores, Inc. System and method for high accuracy product classification with limited supervision
US9600776B1 (en) 2013-11-22 2017-03-21 Groupon, Inc. Automated adaptive data analysis using dynamic data quality assessment
US9652362B2 (en) 2013-12-06 2017-05-16 Qualcomm Incorporated Methods and systems of using application-specific and application-type-specific models for the efficient classification of mobile device behaviors
US9721212B2 (en) 2014-06-04 2017-08-01 Qualcomm Incorporated Efficient on-device binary analysis for auto-generated behavioral models
CA2951723C (en) 2014-06-10 2021-04-27 Sightline Innovation Inc. System and method for network based application development and implementation
US10402746B2 (en) 2014-09-10 2019-09-03 Amazon Technologies, Inc. Computing instance launch time

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12154391B2 (en) 2018-09-30 2024-11-26 Strong Force Tp Portfolio 2022, Llc Intelligent transportation systems including digital twin interface for a passenger vehicle
US12380746B2 (en) 2018-09-30 2025-08-05 Strong Force Tp Portfolio 2022, Llc Digital twin systems and methods for transportation systems
US12283140B2 (en) 2018-09-30 2025-04-22 Strong Force Tp Portfolio 2022, Llc Vehicle dynamics control using deep learning to update an operational parameter of a vehicle drive train
US12169987B2 (en) 2018-09-30 2024-12-17 Strong Force Tp Portfolio 2022, Llc Intelligent transportation systems including digital twin interface for a passenger vehicle
US11061790B2 (en) * 2019-01-07 2021-07-13 International Business Machines Corporation Providing insight of continuous delivery pipeline using machine learning
US11061791B2 (en) * 2019-01-07 2021-07-13 International Business Machines Corporation Providing insight of continuous delivery pipeline using machine learning
US12020133B2 (en) * 2020-01-17 2024-06-25 Apple Inc. Automated input-data monitoring to dynamically adapt machine-learning techniques
US20240338612A1 (en) * 2020-01-17 2024-10-10 Apple Inc. Automated input-data monitoring to dynamically adapt machine-learning techniques
US12430590B2 (en) * 2020-01-17 2025-09-30 Apple Inc. Automated input-data monitoring to dynamically adapt machine-learning techniques
US20210224687A1 (en) * 2020-01-17 2021-07-22 Apple Inc. Automated input-data monitoring to dynamically adapt machine-learning techniques
US20230124380A1 (en) * 2020-01-17 2023-04-20 Apple Inc. Automated input-data monitoring to dynamically adapt machine-learning techniques
US11562297B2 (en) * 2020-01-17 2023-01-24 Apple Inc. Automated input-data monitoring to dynamically adapt machine-learning techniques
CN112132291A (en) * 2020-08-21 2020-12-25 北京艾巴斯智能科技发展有限公司 Intelligent brain optimization method and device applied to government affair system, storage medium and terminal
US20220207059A1 (en) * 2020-12-29 2022-06-30 Landmark Graphics Corporation Smart data warehouse for cloud-based reservoir simulation
US11797577B2 (en) * 2020-12-29 2023-10-24 Landmark Graphics Corporation Smart data warehouse for cloud-based reservoir simulation
US20220318654A1 (en) * 2021-03-30 2022-10-06 Paypal, Inc. Machine Learning and Reject Inference Techniques Utilizing Attributes of Unlabeled Data Samples
US20240328803A1 (en) * 2021-04-28 2024-10-03 Tomtom Navigation B.V. Methods and Systems for Determining Estimated Travel Times Through a Navigable Network
WO2023096968A1 (en) * 2021-11-23 2023-06-01 Strong Force Tp Portfolio 2022, Llc Intelligent transportation methods and systems
US20240320206A1 (en) * 2023-03-24 2024-09-26 Gm Cruise Holdings Llc Identifying quality of labeled data
EP4507260A1 (en) * 2023-08-08 2025-02-12 Nokia Solutions and Networks Oy Apparatuses and methods for selecting candidates end points in a fixed communication network

Also Published As

Publication number Publication date
US10339468B1 (en) 2019-07-02

Similar Documents

Publication Publication Date Title
US20200012963A1 (en) Curating Training Data For Incremental Re-Training Of A Predictive Model
US20190378044A1 (en) Processing dynamic data within an adaptive oracle-trained learning system using curated training data for incremental re-training of a predictive model
US12045732B2 (en) Automated dynamic data quality assessment
US20200302337A1 (en) Automatic selection of high quality training data using an adaptive oracle-trained learning framework
US11016996B2 (en) Dynamic clustering for streaming data
US20190354810A1 (en) Active learning to reduce noise in labels
US20200293951A1 (en) Dynamically optimizing a data set distribution
CN108830329B (en) Picture processing method and device
US20220180250A1 (en) Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization
US11270210B2 (en) Outlier discovery system selection
CN111027707B (en) Model optimization method, device and electronic equipment
US20230334119A1 (en) Systems and techniques to monitor text data quality
US20160055496A1 (en) Churn prediction based on existing event data
CN115349129A (en) Generating performance predictions with uncertainty intervals
EP4254280A1 (en) Dynamically updated ensemble-based machine learning for streaming data
US20250238797A1 (en) Parsing event data for clustering and classification

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: JPMORGAN CHASE BANK, N.A., ILLINOIS

Free format text: SECURITY INTEREST;ASSIGNORS:GROUPON, INC.;LIVINGSOCIAL, LLC;REEL/FRAME:053294/0495

Effective date: 20200717

AS Assignment

Owner name: GROUPON, INC., ILLINOIS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JEFFERY, SHAWN RYAN;JOHNSTON, DAVID ALAN;SIGNING DATES FROM 20151221 TO 20190507;REEL/FRAME:053539/0683

AS Assignment

Owner name: GROUPON, INC., ILLINOIS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:POLYCHRONOPOULOS, VASILEIOS;REEL/FRAME:053548/0027

Effective date: 20130613

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

AS Assignment

Owner name: LIVINGSOCIAL, LLC (F/K/A LIVINGSOCIAL, INC.), ILLINOIS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:JPMORGAN CHASE BANK, N.A.;REEL/FRAME:066676/0001

Effective date: 20240212

Owner name: GROUPON, INC., ILLINOIS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:JPMORGAN CHASE BANK, N.A.;REEL/FRAME:066676/0001

Effective date: 20240212

Owner name: LIVINGSOCIAL, LLC (F/K/A LIVINGSOCIAL, INC.), ILLINOIS

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN INTELLECTUAL PROPERTY RIGHTS;ASSIGNOR:JPMORGAN CHASE BANK, N.A.;REEL/FRAME:066676/0251

Effective date: 20240212

Owner name: GROUPON, INC., ILLINOIS

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN INTELLECTUAL PROPERTY RIGHTS;ASSIGNOR:JPMORGAN CHASE BANK, N.A.;REEL/FRAME:066676/0251

Effective date: 20240212

Owner name: GROUPON, INC., ILLINOIS

Free format text: RELEASE OF SECURITY INTEREST;ASSIGNOR:JPMORGAN CHASE BANK, N.A.;REEL/FRAME:066676/0001

Effective date: 20240212

Owner name: LIVINGSOCIAL, LLC (F/K/A LIVINGSOCIAL, INC.), ILLINOIS

Free format text: RELEASE OF SECURITY INTEREST;ASSIGNOR:JPMORGAN CHASE BANK, N.A.;REEL/FRAME:066676/0001

Effective date: 20240212

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: BYTEDANCE INC., DELAWARE

Free format text: ASSIGNMENT OF ASSIGNOR'S INTEREST;ASSIGNORS:GROUPON, INC.;LIVINGSOCIAL, INC.;REEL/FRAME:071769/0488

Effective date: 20240226

Owner name: BYTEDANCE INC., DELAWARE

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GROUPON, INC.;LIVINGSOCIAL, INC.;REEL/FRAME:071769/0488

Effective date: 20240226