WO2025080778A1

WO2025080778A1 - Dynamic visualization for the performance evaluation of machine learning models and any other type of predictive model

Info

Publication number: WO2025080778A1
Application number: PCT/US2024/050685
Authority: WO
Inventors: Eric Victor Siegel; Lioubov Viatcheslavna GLOUKHOVA
Original assignee: Gooder Ai Inc
Current assignee: Gooder Ai Inc
Priority date: 2023-10-12
Filing date: 2024-10-10
Publication date: 2025-04-17
Anticipated expiration: 2026-04-12

Abstract

A universal system and method for dynamically evaluating and visualizing the performance of any predictive model, including machine learning models. The system and method compute performance metrics based on test set data and display visual representations in real-time, allowing users to interactively explore model performance by adjusting parameters that reflect model-deployment scenarios. Key features include model-agnostic design, support for both technical and business metrics, and the ability to compare multiple models. The system and method's extensible architecture enables custom metrics and visualizations, making them scalable across various modeling use cases and industries. By providing intuitive, real-time visual feedback, embodiments of the invention empower both technical and non-technical stakeholders to gain deeper insights into model behavior, leading to more informed decisions about deployment and optimization.

Description

Dynamic Visualization for the Performance Evaluation of Machine

Learning Models and Any Other Type of Predictive Model

BACKGROUND

[0001] Machine learning, also known as modeling, predictive modeling, predictive analytics, or predictive artificial intelligence (Al), is a field within artificial intelligence and data science. Organizations across various sectors, including finance, healthcare, and retail, use predictive models to automate or support operational decisions based on historical data.

[0002] Predictive models (also known as “models”) generate scores (also known as “predictive scores” or “model scores”) for individuals (e.g., customers, patients, transactions) to estimate the likelihood of a specific outcome, behavior, or category. These scores can pertain to a wide range of predictions, such as customer behavior, medical diagnoses, or fraud detection. Scores are typically formulated as probabilities, but do not need to be.

[0003] A model’s scores can pertain to, for example, whether the individual will click, buy, he, die, cancel as a customer, fail to pay a debt, submit high claims as an insurance policyholder, exhibit a certain health outcome as a healthcare patient, or belongs to a category — such as spam (versus not spam), holding a certain medical diagnosis (versus not holding the diagnosis), depicting an image of a certain object such as a traffic light or a type of animal (versus not depicting that object), or being a fraudulent transaction (versus being an authorized transaction). In the case of models that score whether an individual belongs to a category , the model is sometimes said to "detect", although the word "predict" also still pertains and is commonly used for such projects, e.g., "the model predicts whether each transaction is fraudulent." In all cases, the model is placing odds on an unknown, whether that unknown is a future event or outcome or a currently-held category membership.

[0004] Evaluating the performance of predictive models is crucial for ensuring their reliability and value, and for informing and directing their development. However, existing methods for evaluating model performance often face significant challenges, such as :

• Lack of comprehensibility: Current visualization techniques may not effectively communicate a model’s strengths, weaknesses, and value to users or stakeholders, particularly those without technical expertise.

• Limited interpretability: The complexity' of modem machine learning models can make it difficult for stakeholders to understand how model predictions are derived and how they could improve business outcomes. • Inadequate performance metrics: Many evaluation methods focus on technical metrics that generally do not directly translate to or reveal business value or organizational goals.

• Static visualizations: Existing tools often provide static representations of model performance, limiting users’ ability to explore different scenarios or parameter settings.

[0005] These limitations create a need for more effective, relevant, and user-friendly methods to evaluate and visualize the performance and value of predictive models.

[0006] Organizations use predictive models to improve operational decisions across various domains. This process, known as model deployment, involves using model-generated scores to automate or support decisions in areas such as marketing, fraud investigation, credit approvals, and healthcare diagnostics. Model deployment aims to enhance organizational operations through various machine learning use cases, each defined by a prediction goal and a deployment plan. These use cases span a wide range of applications, including marketing (e.g., response modeling for targeted campaigns), customer retention (e.g., chum modeling), risk management (e.g., fraud detection, credit scoring, insurance selection), human resources (e.g., workforce hiring and retention); safety and maintenance (e.g.. risk modeling for fire prevention, health code violations, and system failures), manufacturing (e g., fault detection and reliability modeling), information security (e.g., spam filtering, network intrusion prevention), law enforcement (e.g., predictive policing), healthcare (e.g., diagnosis, prescription, and treatment adherence prediction), and education (e.g., predicting student success and targeting interventions).

[0007] Predictive models utilize input factors, known as independent variables, to calculate scores for individuals. These models can employ various calculation methods, such as “if-then” rules or mathematical equations, to process the independent variables and generate scores.

[0008] A model is often developed using machine learning, also known as predictive analytics, through a process known as “model training” or merely “training.” Training involves processing training data to create a model that can calculate scores based on independent variables. Various methods to do this — also known as machine learning methods or machine learning algorithms — exist, including statistical regression, decision trees, neural networks, and ensemble models. Large language models (LLMs) can also generate predictive scores, e.g., by prompting the LLM with the same question multiple times and using the proportion of positive responses as the predictive score; therefore, an LLM can serve and qualify as a predictive model. Alternatively, predictive models may be manually developed by human engineers or other human personnel, resulting in “hand- written” models. These models can be used and evaluated in the same way in which those developed through machine learning are used and evaluated.

[0009] After a model is trained, it must be evaluated to determine its suitability for deployment. Evaluation is performed using a test set, which contains example cases with known outcomes (dependent variables or “actuals”). The values of the dependent variable could originate from manually labeling the data or could originate from historical outcomes, such as whether a customer made a purchase. The test set is different from the training data so that it can serve the purpose of objectively assessing the model's performance on new cases. Model evaluation involves quantitatively measuring predictive performance by calculating performance metrics. This process is also known as model testing, validation, or benchmarking. It applies to both machine learning- developed and hand-written models.

[0010] Some performance metrics depend on a confidence threshold (also known as decision threshold), which determines whether a model’s prediction is considered “positive” or “negative,” and therefore determines the action take for the case. For example, with a 90% confidence threshold, only scores of 90% or higher are considered "positive" predictions. When predicting purchases, this could mean that individual customers considered as “positive” (because they are above that threshold) are contacted with marketing; when predicting fraud, this could mean that such individual transactions are blocked. This threshold must be set to a specific value in order to calculate certain metrics, such as accuracy (the proportion of correct predictions) and profit. For profit calculations in marketing campaigns, the threshold determines which individuals are targeted. The profit is then calculated by subtracting the cost of false positives (marketing to non-responders) from the gain of true positives (successful marketing to responders). Similarly, aggregate cost (or loss) is calculated using false positive and false negative costs, based on the confidence threshold. The values of metrics such as accuracy, profit, and others are always relative to a specific confidence threshold.

[0011] Organizations typically use technical metrics rather than business metrics to evaluate model performance. Technical metrics measure a model’s relative performance, in comparison to a baseline method such as random guessing, while business metrics assess a model’s value in terms of organizational goals or key performance indicators. Technical metrics report on the pure predictive performance of a model, with no aspect of the potential operational use of the model involved in their calculation. In contrast, business metrics must involve business factors and must relate to the potential operational use of the model, since they estimate the business value that would be gained by way of that operational use.

[0012] Examples of technical metrics include accuracy, area under the ROC curve (AUC), f-score, precision, recall, and various rates (false positive, false negative, true positive, true negative). Examples of business metrics include profit, revenue gain, savings, costs, customer attrition, marketing response rate, return on investment, and other operational efficiency measures. Despite their importance, business metrics are often underutilized in the machine learning industry. Modeltraining software tools and other machine learning software A pically overlook these metrics, forcing data scientists who want to use them to implement them manually. Moreover, data science books and training programs, including educational degree programs, typically overlook business metrics as well.

[0013] Organizations face three main challenges in evaluating predictive models. First, organizations require business metrics, rather than technical metrics, to determine if a model’s potential operational improvements align with organizational objectives. Second, organizations must decide not only whether to deploy a model, but how to deploy it to maximize business value. This includes determining the value for the confidence threshold for triggering operational decisions. Third, non-data science experts, such as managers and decision-makers, need to understand deployment parameters (like confidence thresholds) and evaluation results in terms that are relevant to organizational goals.

[0014] Model evaluation is critical for the successful deployment of machine learning projects. Yet limitations in current techniques for performing model evaluation lead to a variety of problems, including project failures and cancellations. Without evaluating in terms of business metrics, the potential value of a model in the context of its planned deployment is not assessed. In other words, the model has not been stress-tested in its intended usage - and, if an organization is not measuring business value, it is not pursuing business value. Although effective model evaluation requires metrics that are aligned with organizational objectives and are understandable to both data scientists and non-technical stakeholders, existing techniques for evaluating models fail on both counts.

SUMMARY

[0015] A universal, model-agnostic computer-implemented system dynamically evaluates and visualizes the performance of any model, such as a machine learning model, regardless of its type or implementation. The system receives test set data containing model score outputs and actual dependent variable values for a plurality of test cases. The system then computes performance metric values based on this data using a specified performance metric with adjustable parameters. A key innovative aspect is the system’s ability to display and update visual representations of these performance metrics in real-time, allowing users to interactively explore and analyze model performance. Users can adjust parameters through an intuitive interface, and the system immediately updates the performance metric calculations and visual outputs to reflect these changes.

[0016] This dynamic, interactive approach enables users to evaluate models using both technical and business metrics, compare multiple models side-by-side, optimize the setting of the decision threshold in accordance with multiple, competing metrics, optimize the modeling process for specific use cases, and assess model performance across different scenarios and thresholds.

[0017] The system's model-agnostic design allows it to work with any type of model, from neural networks to decision trees to hand-written models, providing a consistent evaluation framework across an organization's diverse modeling projects. This universality, combined with its extensible architecture that allows users to define custom performance metrics and visualizations, makes the system highly scalable across various modeling use cases and industries.

[0018] By providing intuitive, real-time visual feedback on model perfonnance, the system enables both technical and non-technical stakeholders to gain deeper insights into model behavior, leading to more informed decisions about model development and deployment. This approach to model evaluation addresses a critical need in the machine learning industry, potentially improving the success rate of model deployments and the overall value derived from machine learning projects.

[0019] Other features and advantages of various aspects and embodiments of the present invention will become apparent from the following description and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0020] FIGS. 1A-1B are dataflow diagrams of a system for evaluating the performance of a first model according to one embodiment of the present invention.

[0021] FIG. 2 is a flowchart of a method performed by the system of FIG. 1 according to one embodiment of the present invention.

[0022] FIGS. 3A-3D are illustrations of graphical user interfaces for displaying model performance metrics and enabling user manipulation of parameters that affect model performance metric values according to various embodiments of the present invention.

DETAILED DESCRIPTION

[0023] Embodiments of the present invention address challenges in evaluating and deploying models, such as machine learning models, across various industries and use cases. As organizations increasingly rely on predictive models to improve their operations, there is a growing need for comprehensive, flexible, interactive, and user-friendly tools to assess both technical model perfonnance and tire business value of models. Embodiments of the present invention address these needs.

[0024] In particular, embodiments of the present invention provide a computer-implemented method for evaluating the performance of models. The method includes: identifying a model and specifying a performance metric with adjustable parameters; receiving test set data that includes model scores and actual outcomes for a set of test cases; computing perfonnance metric values based on the test set data and the specified metric; displaying visual output representing these performance metric values; and allowing a user to adjust metric parameters and immediately see updated results. Note that any reference herein to a “score” or “model score” without a qualifier (such as “machine learning”) refers to a predictive score, w hich itself refers to any score that is output by a model of any kind. [0025] One insight on which embodiments of the present invention are based is that the calculation of performance metrics, including both technical metrics and business metrics, is based on a model's scores, not on the internal workings of a model. Metrics reflect how well a model works - not how a model works. Although model training relates directly to the model's internal mechanics, model evaluation does not. Embodiments of the present invention take advantage of this insight to evaluate the performance of models without requiring access to or knowledge of how those models work.

[0026] In addition to the step of evaluating a model before its deployment, business metrics are also vital to the step of evaluating a model after its deployment, in order to monitor its performance and reassess the way in which it is deployed. Moreover, evaluating models in terms of business metrics is also vital in order to evaluate an updated or refreshed model that has been trained over more recent data to serve as a candidate replacement model.

[0027] In addition to assessing whether and how to deploy a model, evaluating a model in terms of business metrics is also vital for the development and improvement of the model in the first place. Model development may be a semi-automatic process that involves a repeated train-and-test cycle. The training step may be automatic, using a machine learning method to generate a model. But then the model is evaluated by a human data scientist so that he or she can consider potential alterations to its training process, including expanding or otherwise improving the training data, the choice of machine learning method, and the parameter settings that control the machine learning method. The evaluation step of this cycle is typically conducted only in terms of technical metrics. This severely limits the potential business value of the model, since only business metrics can inform the navigation of model development toward business value. You cannot know whether you’re headed in the right direction unless you measure your progress toward the desired destination. Since the aim is business value, business metrics must be calculated.

[0028] To conclude, if you are not measuring business value, you are not pursuing business value. To develop models, such as machine learning models, capable of driving valuable improvements to operations, it is critical to evaluate models in terms of business value — that is, in terms of business metrics. Doing so is also critical for communicating the business value to non-data science experts such as business decision makers and stakeholders. Any business professional in a position to decide whether and how to deploy a model to improve operations can only do so in an informed manner by evaluating the model in terms of business metrics.

[0029] In contrast to calculating technical metrics, which abstract a model’s predictive performance away from the manner in which the model’s outputted predictive scores may be utilized, calculating business metrics must involve particulars of the business context in which the model’s predictive scores will be utilized, and also must involve the specific way in which the scores will be utilized. This is because a business metric estimates the business value of a model in its intended usage. Therefore, a business metric’s value often depends in part on parameters that are subject to change or uncertainty. For example, to calculate tire savings - a business metric - of a fraud detection predictive model, the false positive cost and false negative cost must be established. Even after these parameters are estimated, they are subject to change and/or uncertainty. A change to them will alter the shape of a savings curve chart, which plots the savings against the confidence threshold. The shape of that curve informs various pragmatic tradeoffs that can only be decided on by human decision makers and stakeholders. Since that shape may be altered by the value of these and other parameters, it is critical to provide a means to visualize the effects of altering the parameters on tire shape of that curve.

[0030] This approach offers several key advantages over current evaluation practices. For example, the method can work with any type of model, from neural networks to decision trees to hand-written models, making it universally applicable across different predictive modeling approaches. By incorporating both technical and business metrics, embodiments of the present invention bridge the gap between model performance and organizational objectives. Users can dynamically adjust parameters, such as decision thresholds, and immediately visualize the impact on model performance, facilitating better decision-making about deployment strategies. The visual output and interactive elements make complex model evaluation concepts more understandable to non-technical stakeholders, addressing a common communication barrier in machine learning projects. The method’s design allows for the incorporation of custom metrics and visualizations, enabling it to adapt to diverse use cases and organizational needs.

[0031] By providing a comprehensive toolkit for model evaluation that is model-agnostic, embodiments of the present invention aim to improve the success rate of deploying machine learning models and other models. Such embodiments empower organizations to make more informed decisions about model suitability, optimize deployment strategies, facilitate decision-maker deployment authorization by communicating a model’s value in terms of business metrics, and ultimately realize greater value from their investment in machine learning models and other models.

[0032] Referring to FIGS. 1A-1B, dataflow diagrams are shown of a system 100 for evaluating tire performance of a first model according to one embodiment of the present invention. Referring to FIG. 2, a flowchart is shown of a method 200 performed by the system 100 according to one embodiment of the present invention.

[0033] Tire system 100 includes a first machine learning model 102a and a machine learning model identification module 104, which identifies the first machine learning model 102a (FIG. 2, operation 202). As shown in FIG. 1A. the first machine learning model 102a may be one of a plurality of machine learning models 102a-/ in the system 100, where n may be any number. Even more generally, although the term “machine learning model” is used for convenience to refer to the models 102a-/-?. any of the models 102a-w may be any kind of model, such as a machine learning model or a hand-written model. In general, any reference herein to a “model’’ without a qualifier (such as “machine learning” or “hand-written”) refers to any kind of model. As this implies, even though the module 104 is referred to herein as a “machine learning” model identification module 104, the module 104 may identify any kind of model, whether that model is a machine learning model or a hand-written model.

[0034] Tire first machine learning model 102a may, for example, be any type of machine learning model, such as any of the following: a neural networks (such as a deep learning model), a decision tree, an ensemble model (such as a random forest), a large language model (LLM), a logistic regression model: a statistical regression model: a naive Bayes classifier; a support vector machine (SVM); a K-nearest neighbors (KNN) model, or a gradient boosting machine. The first machine learning model 102a may be a trained model or a hand-written model.

[0035] Tire plurality of machine learning models 102a-w may include multiple machine learning models of the same type, multiple machine learning models of different types, multiple hand-written models of the same type, multiple hand-written models of different types, at least one machine learning model and at least one hand-written model, or any combination thereof. For example, any two of the plurality of machine learning models 102a-w may be of the same type or of a different type.

[0036] Although the machine learning models 102a-w (including the first machine learning model 102a) and the machine learning model identification module 104 are shown in FIGS. 1A-1B for ease of illustration and explanation, it should be understood that the system 100, and embodiments of the present invention more generally, do not need to include any models. Furthermore, embodiments of the present invention do not need to develop (e.g., train or hand-write) any models or use any models (whether machine-learning or hand-written), such as to generate any of the scores disclosed herein. Instead, embodiments of the present invention may perform tire functions disclosed herein based on scores that were output by one or more models, even if such embodiments (e.g.. the system 100) do not include such models and do not use such models to generate such scores. Instead, for example, the system 100. and embodiments of the present invention more generally, may perform any of the functions disclosed herein in connection with model scores, even if such scores were generated by another system outside of embodiments of the present invention, regardless of the method that was used to generate such scores. Although, as the preceding discussion implies, and as disclosed elsewhere herein, embodiments of the present invention may include one or more models (such as the machine learning models 102a-w). and although embodiments of tire present invention may use such models to generate model scores in order to facilitate certain kinds of system integrations, this is optional and not required by all embodiments of the present invention. [0037] The machine learning model identification module 104 may include identifying any of a variety of aspects of the first machine learning model 102a, such as any one or more of the following:

• Unique Identifier (ID): This may. for example, be any data that uniquely identifies the first machine learning model 102a among the plurality of machine learning models 102a-n.

• Model type: This may specify whether the first machine learning model 102a is a neural network, decision tree, ensemble model, hand-written model, or any other particular type of model.

• Model source (e.g., trained or hand-written).

• Model metadata: This may include information such as the first machine learning model 102a’s version, development (e.g., training) date, an enumeration of required input variables, or other relevant information.

[0038] The machine learning model identification module 104 may identify’ the first machine learning model 102a in any of a variety of ways. For example, a user 126 of the system 100 may provide input selecting the first machine learning model 102a. in response to which the machine learning model identification module 104 may identify the first machine learning model 102a based on the user input. As another example, the machine learning model identification module 104 may identify the first machine learning model 102a automatically, such as based on predefined criteria. In a programmatic implementation, the machine learning model identification module 104 may identify the first machine learning model 102a using an API call or other programmatic action, for example when a model -training system invokes an embodiment of this invention in order to allow a user to conduct a more comprehensive evaluation of a model that the model -training system has trained.

[0039] Although operation 202 is described herein in connection witii identifying the first machine learning model 102a, it should be understood that die machine learning model identification module 104 may identify a second one of the plurality of machine learning models 102a-w, and any additional number of the plurality of machine learning models IO2a-« in addition to the first machine learning model 102a, either at the same time as operation 202 or at one or more subsequent times, using any of the techniques disclosed herein for selecting the first machine learning model 102a.

[0040] Tire system 100 also includes a plurality of performance metric specifications IO6a-m. where m may be any number. Each of the plurality of performance metric specifications 106a-m specifies (e.g., represents) a corresponding perfonnance metric. The system 100 also includes a performance metric specification identification module 108, which identifies a first performance metric specification 106a in the plurality of performance metric specifications 106a-/n (FIG. 2, operation 204). The first performance metric specification 106a has a first parameter 130a having a first value 132a. Tire first parameter 130a may, for example, be a decision threshold. [0041] In general, a “performance metric,” as that term is used herein, is a quantitative measure used to evaluate tire effectiveness, correctness, and/or value of a model's predictions. A performance metric serves as a standardized way to assess how well a model performs its intended task, typically accomplished in part by comparing the model's predictions to known actual outcomes. Performance metrics may take a variety of forms, such as mathematical formulas, comparative measures, aggregation functions, threshold-based calculations, financial forecasts, and cost or value functions. Performance metrics may be broadly categorized into technical metrics (which report on the predictive performance of a model irrespective of the model’s intended usage) and business metrics (which relate to organizational goals by reporting on tire value attained by using the model in terms such as profit and loss). Tire term “metric.” when used herein without a qualifier (such as “technical” or “business”) is used synonymously with “performance metric.”

[0042] Tire first performance metric specification 106a may be a specification of any type of technical performance metric, such as any of the following: accuracy (the proportion of correct predictions made by the model); area under the receiver operating characteristic curve (AUC); precision (the proportion of true positive predictions among all positive predictions); recall, also known as sensitivity or true positive rate (the proportion of true positive predictions among all actual positive cases); F-score (a metric that combines precision and recall); specificity or true negative rate (the proportion of true negative predictions among all actual negative cases); lift (a measure of how much better the model perfonns compared to random guessing); gains (a cumulative measure of the model's predictive power); false positive rate (the proportion of false positive predictions among all actual negative cases); false negative rate (the proportion of false negative predictions among all actual positive cases).

[0043] The first performance metric specification 106a may be a specification of any type of business performance metric, such as any of the following: profit (the financial gain resulting from model deployment); revenue gain (the increase in revenue attributed to the model's predictions); savings (cost reductions achieved through model implementation); costs (various expenses associated with model deployment and operation); misclassification costs (the financial impact of false positives and false negatives); customer attrition (the rate at which customers are lost, which the model aims to reduce); loss ratio (particularly relevant for insurance companies, measuring the ratio of losses to premiums); marketing response rate (the effectiveness of targeted marketing campaigns based on model predictions); return on investment (ROI) (the overall financial return relative to the cost of implementing and using the model); debtor default rate (the rate at which debtors fail to repay loans, which the model may aim to predict and reduce); labor reduction (efficiency gains in terms of reduced labor requirements); customer retention rate the rate at which customers continue to use a product or service); customer acquisition (tire number of new customers gained through model-driven strategies); amount of fraud prevented (the financial or quantitative measure of fraud detection and prevention); lives saved (In healthcare or safety applications, the number of lives potentially saved through model predictions).

100441 The plurality of performance metric specifications 106a-/?? may include specifications of multiple performance metrics of the same type, specifications of multiple performance metrics of different types, or any combination thereof. For example, any two of the plurality of performance metric specifications 106a-/?? may be specifications of performance metrics of the same type or of a different type.

[0045] The first performance metric specification 106a may specify a first performance metric that has a first parameter 130a having a first value 132a. The value 132a affects the value of the first performance metric; its calculation is based in part on 132a. In practice, the first performance metric specification 106a may include a data structure having the first parameter 130a, which may have the first value 132a. Because the first performance metric specification 106a specifies or otherwise represents the first performance metric, it should be understood that any reference herein to “tire first perfonnance metric specification"’ may refer to the first performance metric and vice versa.

[0046] The first parameter 130a may be any of a variety of parameters, such as any of the following:

• Decision threshold: This is explicitly mentioned in the claim and is a crucial parameter for many metrics. It determines the score above which a model's prediction is considered "positive" and therefore determines the cases for which a certain business operation would or would not be conducted. For example, with a 90% confidence threshold, only scores of 90% or higher would be considered positive predictions, and certain business operations would or would not be conducted cases with such scores. The decision threshold could, for example, determine which transactions would be screened for fraud, which customers would be contacted for marketing, and which satellites would be investigated for their battery to soon run out of energy.

• False positive cost: This parameter represents the cost, loss, or penalty associated with a false positive prediction. It's particularly relevant for business metrics like profit or cost calculations.

• False negative cost: Similar to false positive cost, this parameter represents the cost or penalty associated with a false negative prediction.

• True positive gain: This parameter represents the benefit or value associated with a correct positive prediction.

True negative gain: This parameter represents the benefit or value associated with a correct negative prediction. Population size: This parameter can be used to scale metrics for forecasting performance over an arbitrarily large number of expected cases, rather than only calculating for the particular number of cases in the test set.

• Time window: For time-sensitive metrics, this parameter could specify the time frame over which the metric is calculated.

• Segmentation criteria: This parameter could be used to calculate metrics for specific subsets of the data based on certain characteristics.

[0047] The first performance metric may have any number of parameters, such as one or more parameters in addition to the first parameter 130a. Multiple parameters of the first performance metric 106a may differ from each other in any way. For example, parameters of the first performance metric 106a may include any two or more of the parameters in the list above.

[0048] As mentioned above, the first parameter 130a of the first performance metric 106a may have a first value 132a. The first value 132a may, for example, be an initial setting of the first parameter 130a, serving as a starting point for the evaluation process described herein. This allows the method 200 to begin with an initial default setting for comparison purposes. The first value 132a of the first parameter 130a may be user-defined (e.g.. by the user 126). allowing for customization based on specific use cases or organizational needs. Alternatively, the first value 132a of the first parameter 130a may, for example, be a default value predetermined by the system 100.

[0049] The first parameter 130a may have a possible range of values, and the first value 132a of the first parameter 130a may be within that possible range of values. For example, if the first parameter 130a is a decision threshold, it might range from 0 to 100 percent, and the first value 132a of the first parameter 130a may be between 0 and 100 (inclusive). An embodiment of the invention may allow the user to specify the allowable possible range of values.

[0050] As will be described in more detail below, the value of the first parameter 130a may be variable. As this implies, at an earlier or subsequent time, the first parameter 130a may have a value other than the first value 132a.

[0051] Anything said herein about the first performance metric 106a is equally applicable to any performance metric. Anything said herein about the first parameter 130a of the first performance metric 106a is equally applicable to any parameter of the first performance metric 106a, and to any parameter of any performance metric. Anything said herein about the first value 132a of the first parameter 130a of the first performance metric 106a is equally applicable to any value of any parameter of any performance metric.

[0052] The system 100 also includes test set data 110 associated with a test set. The system 100 includes performance metric computation module 114, which receives the test set data 110 (FIG. 2, operation 206). Tire purpose of using a test set is to evaluate tire first machine learning model 102a’s performance on data it has not seen during training, providing an objective measure of its predictive capabilities.

[0053] The test set data 110 includes a plurality of test case data elements 112 corresponding to a plurality of test cases in the test set. The plurality of test case data elements 112 include a plurality of score outputs and a plurality of dependent variable values. Each test case data element E in the plurality of test case data elements 112 corresponds to a corresponding test case C. Each test case data element E includes: (1) data representing a score output by the first machine learning model 102a when test case C is provided as an input to the first machine learning model 102a; and (2) data representing a value of a dependent variable for test case C. This structure allows for establishing the relationship, if any, between the predictions of the first machine learning model 102a and the known outcomes, which enables the system 100 and method 200 to calculate performance metrics.

[0054] As described elsewhere herein, the system 100 and method 200 can work with any type of model, as they only require the first machine learning model 102a’s output scores (in the plurality of test case data elements 112) and do not need access to the first machine learning model 102a's internal workings.

[0055] Furthermore, the test set data 110 may, but need not be, identical in structure to the test data that was used to train and/or test tire first machine learning model 102a or any of the machine learning models 102a-w. As a result, tire test set data 110 need not be “test data” as that temr is used conventionally in the field of machine learning. For example, the test set data 110 may include tire same set of variables as the training data that was used to train one or more of the machine learning models 1 2a-w (such as the first machine learning model 102a). and also include one or more variables that were not part of the original training or test sets used to develop such model(s). As a particular example, the test set data 110 may include both: (1) some or all of the variables of the training data that was used to train the first machine learning model 102a, such as in the form of a table; and (2) an additional variable containing, for each of the plurality of test cases represented by the test case data elements 112, the score produced by the first machine learning model 102a for that test case. It could also include (3) an additional variable not included within the training data but provided for use in the calculation of a metric, such as the magnitude of a transaction that may be fraudulent, since that magnitude affects the cost associated with a model failing to correctly identity' transactions that arc fraudulent. To assist in evaluating multiple models simultaneously, additional score columns may be included for each model. This allows for the incorporation of additional information, in the test set data 110. that may be relevant to the evaluation process.

[0056] The performance metric computation module 114 may compute, based on each of the plurality' of test case data elements 112, a corresponding first performance metric value that represents performance of tire first machine learning model 102a relative to the plurality of dependent variable values, using the first performance metric, thereby computing a plurality of first perfonnance metric values 116a corresponding to the plurality of test case data elements 112 (FIG. 2, operation 208). This step leverages the structure of the test set data 110 received in step 206, which includes both the first machine learning model 102a’s score outputs and the actual dependent variable values for each test case (in the plurality of test case data elements 112). The purpose of this computation is to generate a set of performance measurements that represent how well the first machine learning model 102a performs across all test cases. By calculating a metric value for each test case, the method 200 provides a detailed view of the first machine learning model 102a’s performance, allowing for both aggregate analysis and examination of performance variations across different subsets of the data.

[0057] Tire computation method used by the performance metric computation module 114 to compute the plurality of first perfonnance metric values 116a may vary depending on the specific perfonnance metric being used. For example, for a metric such as accuracy, the computation may involve comparing the first machine learning model 102a’s prediction (based on its score output and the decision threshold) to the dependent variable value for each test case. For a metric such as profit, the calculation may incorporate the true positive profit and false positive loss values, applying them based on whether the first machine learning model 102a’s prediction (based on its score output and the decision threshold) matches the actual outcome.

[0058] A plurality of metric values are computed, one for each test case, because each test case corresponds with one possible setting for the decision threshold, and each setting for the decision threshold affects tire calculation of a metric. When a decision threshold is set at 0%. it corresponds with 0 test cases, which means that all test cases are treated as positive cases. When a decision threshold is set at 1/p, where p equals the number of test cases, it means that all test cases but the first test case are treated as positive cases. This continues for the entire range of possible threshold settings, corresponding with the entire plurality of test cases. This range of metric values corresponding with the range of possible threshold settings can be depicted in a chart where the horizontal coordinate represents the threshold setting and the vertical axis represents the metric value.

[0059] To more efficiently compute the plurality of metric values, one per test case, the test cases may first be sorted by their scores before computing the metric values. For example, after sorting the test scores, the profit metric may be calculated across the list of test cases, keeping a running total of false positives and true positives so that the Performance Metric Computation Model need only run through the test set one time. A slightly more complex method for calculating the cost performance metrics, based on a running total of false positives and false negatives, need only run through the test set two times. Either way, tire computational complexity of the Performance Metric Computation Model's method is only O(p), where p is the number of test cases.

[0060] The system 100 includes a visual output module 118, which displays visual output 120a representing the plurality of first performance metric values 116a (FIG. 2. operation 210). The visual output 120a makes the performance of the first machine learning model 102a easily interpretable and actionable for users, such as the user 126.

[0061] Tire visual output 120a may take any of a variety of forms, such as any one or more of the following in any combination:

• Graphs and Charts: o Profit Curves: These plot cumulative profit against the percentage of cases targeted or the confidence threshold. The horizontal axis represents the confidence threshold from 0% to 100%, while the vertical axis shows the profit. This allows users to visualize how profit changes as the threshold is adjusted. o Cost Curves: Similar to profit curves, these display the aggregate cost across different confidence thresholds. They are particularly usefill for use cases like fraud detection or risk assessment. o ROC (Receiver Operating Characteristic) Curves: These plot the true positive rate against the false positive rate at various threshold settings, providing a visual representation of the trade-off between sensitivity and specificity. o Precision-Recall Curves: These show the trade-off between precision and recall as the decision threshold is varied. o Lift Charts: These display how much better the model performs compared to random selection across different confidence thresholds. Specifically, a lift chart displays the proportion of positive cases correctly identified the model for each decision threshold setting, and shows for comparison the average random model, represented as a straight diagonal line from the bottom-left to the top-right.

• Interactive Dashboards: o These combine multiple visual elements, such as graphs, metrics, and controls, into a single interface. Users can interact with various components to explore different aspects of the model's performance. o Dashboards may include sliders or other on-screen controls for adjusting parameters like true positive profit, false positive loss, and population size. As users adjust these parameters, the affected visual elements update in real-time to reflect the changes. o They may also include geographical selection tools, allowing users to evaluate model performance for specific regions or subsets of data. • Heat Maps or Color-Coded Tables: o These can represent performance across different subsets of data or different threshold values. For example, a heat map could show how the model performs across different customer segments or product categories. o Color gradients can be used to quickly identify areas of strong or weak performance, making it easy to spot patterns or anomalies in the model's behavior.

• Histograms or Distribution Plots: o These can show the distribution of performance metric values across a range of decision threshold settings, providing insights into the overall spread and central tendency of the model's performance. o Overlaying multiple histograms can allow for easy comparison between different models or different subsets of the data.

• Comparative Visualizations: o These allow for side-by-side comparison of multiple models. For example, a single graph might show profit curves for three different models, each in a different color, allowing users to easily compare their performance. o A table can more succinctly show a plurality of metrics for ease of visual comparison. The metrics in a table all update in real-time as the user alters any parameter settings.

• Explanatory Visualizations: o These include features like an "Explain" button that generates both visual and auditory explanations of the displayed metrics and graphs. o They may use animations or step-by-step breakdowns to guide users through the interpretation of complex visualizations, making them accessible to non-tcchnical stakeholders. o A chatbot, such as one utilizing a large language model, can be included to allow the use to pose questions about the meaning of the charts and all visual elements, and to expound on the pragmatic tradeoffs between models and between parameter settings reflected by the visual elements. Such a chatbot allows interrogation by the user to guide decision -making as to which model to deploy and how' to deploy it.

• Dynamic, Real-Time Visualizations: o These update smoothly and quickly in response to user interactions, such as adjusting decision threshold values or other parameters. o They may employ efficient calculation methods and data structures to enable fast updates even for large test sets. [0062] For example, referring to FIG. 3A, an example graphical user interface (GUI) component is shown for visualizing and interactively adjusting model performance metrics. The GUI of FIG. 3A includes a profit curve graph that displays the relationship between the confidence threshold and the profit generated by the model. The GUI also includes two slider controls: a true positive profit slider and a false positive loss slider. These sliders allow the user to dynamically adjust the values for true positive profit parameter and false positive loss parameter, respectively. As the user moves these sliders, the system recalculates and displays the altered profit curve in real-time using tire efficient binning and calculation methods disclosed herein. This typically results in the overall shape of the curve dynamically changing in response to the user's manipulation of a slider.

[0063] A vertical dotted line is displayed on the profit curve graph, representing the current confidence threshold setting. The user can click and drag this line to interactively adjust the confidence threshold. As the line is moved, the system updates the displayed profit value and other relevant metrics based on the new threshold position. This interactive visualization allows users to explore how different parameter settings affect the model's performance and profitability. By combining the efficient calculation methods disclosed herein with the binning approach disclosed herein, the system can provide smooth, real-time updates to the profit curve as the user adjusts the sliders or the confidence threshold line, even when working with large datasets.

[0064] Referring to FIG. 3B, a GUI is shown for visualizing and comparing the performance of multiple models. The GUI includes a profit curve graph that displays the relationship between the confidence threshold and the estimated profit generated by three different models. Tire GUI includes two slider controls similar to those in FIG. 3A: a true positive profit slider and a false positive loss slider. These sliders allow the user to dynamically adjust the values for true positive profit parameter and false positive loss parameter, respectively. As the user moves these sliders, the system recalculates and displays the altered profit curves for all three models in real-time using the efficient binning and calculation methods described earlier. This typically results in the overall shapes of the curves dynamically changing in response to the user’s manipulation of a slider.

[0065] Referring to FIG. 3C, a GUI is shown for visualizing the perfonnance of a single model using a gains curve. The GUI includes a gains curve graph that displays the relationship between the confidence threshold (represented on the horizontal axis) and the proportion of positive cases identified (represented on the vertical axis). The gains curve provides a visual representation of the model's ability to identify positive cases at different confidence threshold levels. As tire confidence threshold increases (moving from right to left on the horizontal axis), the proportion of positive cases identified by the model increases, and yet the relative performance decreases, if compared to the baseline of random guessing, shown as a straight line from the botom-left to the top-right of the chart. [0066] Referring to FIG. 3D, a GUI is shown for visualizing and comparing the performance of multiple models using a savings curve. Tire interface 330 includes a savings curve graph that displays the relationship between the percentage of cases treated (represented on the horizontal axis) and the amount of savings generated (represented on the vertical axis). The horizontal axis extends from 0% on the left to 11.58% on the right; this '‘zoomed in’’ focus allows for a more detailed view of the relative performance of the models within that range of percent-treated options. An embodiment of the invention could allow the user to zoom or unzoom in this way by way of interface interactions. The savings curve provides a visual representation of the model’s ability to generate savings at different treatment threshold levels. As the percentage of treated cases increases (moving from left to right on the horizontal axis), the amount of savings changes, potentially reaching a peak before declining.

[0067] Tire GUI shown in FIG. 3D includes multiple performance curves as examples, each representing a different model: Decision Tree 1, Decision Tree 2, and Logistic. Although these particular performance curves are merely examples, FIG. 3D more generally illustrates tire ability of a GUI to assist the user in visually comparing the performance of different model types in terms of the savings they generate.

[0068] A vertical dotted line is shown at approximately 2. 14% on the horizontal axis, which may be moved by the user to set a specific treatment threshold. As this line is moved, the system 100 may update the displayed savings amount at that threshold. Such updating may leverage the efficient binning and calculation methods described earlier to provide real-time updates.

[0069] The GUI shown in FIG. 3D includes two parameter controls at the bottom: “Cost of wrongly blocking” and “Cost of undetected fraud.” These parameters represent the false positive and false negative costs respectively, which are used to calculate the overall savings in a fraud detection scenario in which each case corresponds with a transaction that may or may not be fraudulent. The user may use the sliders associated with these parameters, in response to which the system may update the savings curve in FIG. 3D using any of the techniques disclosed herein to enable the user to see how such adjustments in parameter values affect the savings curves in real-time.

[0070] The system 100 (FIG. IB) may include a parameter adjustment module 122, which may receive first parameter adjustment input 124 from the user 126 (FIG. 2, operation 212). As will be described in more detail below, such input 124 enables the user 126 to interactively explore and optimize the performance of the first machine learning model 102a. The system 100 and method 200 may receive the first parameter adjustment input 124 from the user 126 using any of a variety of user interface elements (e.g.. graphical user interface (GUI) elements), such as any one or more of the following, in any combination: • Sliders, dials, or levers: Tire user 126 may move a slider, dial, or lever to adjust the values of parameters such as decision thresholds or cost values.

• Text input fields: The user 126 may directly enter specific values for parameters.

• Dropdown menus: The user 126 may select values from predefined options for categorical parameters.

• Interactive graphical elements: The user 126 may adjust parameter values by interacting with visual representations, such as moving a line on a graph.

[0071] The first parameter adjustment input 124 may specify an adjustment to any of a variety of parameters. For example, if the first parameter 130a is a decision threshold, then the first parameter adjustment input 124 may specify an adjustment to the confidence threshold that determines when the first machine learning model 102a’s prediction is considered positive. For example, changing the threshold from 90% to 80% would classify’ more predictions as positive. Such an adjustment may assist in finding an optimal balance between, for example, true positives and false positives.

[0072] As another example, if the first parameter 130a is a false positive cost, then the first parameter adjustment input 124 may specify a modification to the cost associated with false positive predictions. Such a modification may assist in reflecting a real -world change in the business impact of false positives in the performance metric.

[0073] As another example, if the first parameter 130a is a true positive profit, then the first parameter adjustment input 124 may specify’ a modification to the value or benefit associated with true positive predictions. Such a modification may assist in more accurately representing, or updating when needed, the gain from correct positive predictions in the performance metric.

[0074] By allowing users to adjust these parameters, the system 100 enables interactive exploration of tire first machine learning model 102a’s pcrfonnancc under different scenarios. This interactivity supports better decision-making about model deployment and optimization, as users can immediately see the effects of their adjustments on the first machine learning model 102a’s performance metrics.

[0075] The parameter adjustment module 122 may update the first parameter 130a of tire first performance metric 106a to a second value 132b of the first parameter 130a of the first performance metric 106a, based on the first parameter adjustment input 124 (FIG. 2, operation 214). Tire second value 132b of the first parameter of the first performance metric may differ from the first value 132a of the first parameter 130a of the first perfonnance metric 106a.

[0076] Such updating may be performed in any of a variety of ways. For example, the parameter adjustment module 122 may directly assign the value specified by the first parameter adjustment input 124 as the second value 132b of the first parameter 130a. As another example, the parameter adjustment module 122 may perform a calculation based on tire value specified by the first parameter adjustment input 124, and assign the result of the calculation as the second value 132b of the first parameter 130a.

[0077] The performance metric computation module 114 may compute, based on each of the plurality of test case data elements 112, a corresponding updated first performance metric value using the second value 132b of the first parameter 130a of the first performance metric 106a, thereby computing a plurality of updated first performance metric values 116b corresponding to the plurality of test case data elements 112 (FIG. 2, operation 216). In other words, the method 200 may recompute the values of the first performance metric 106a using the updated (second) parameter value 132b that was set in step 214. This step enables the system 100 and method 200 to provide feedback (e.g., in real-time) on how the parameter adjustment reflected in the first parameter adjustment input 124 affects the first machine learning model 102a’s performance evaluation. Note that operation 216 may use the same test set data 110 that was used in the initial computation performed by the performance metric computation module 114 (step 208), but may apply the new (second) parameter value 132b to recalculate the values of the first performance metric 106a.

[0078] The purpose of this recomputation is to show how the first machine learning model 102a's performance changes in response to the parameter adjustment that results from the first parameter adjustment input 124. By calculating the updated metric values 116b for each test case, the method 200 provides a comprehensive view of how the parameter change affects performance across all of the test cases.

[0079] The method of recomputation will depend on the specific performance metric being used and how it incorporates the adjusted parameter value. For example, if the first parameter 130a is a decision threshold, the computation might involve re-evaluating each model prediction against this new threshold and recalculating the metric accordingly. For a metric like profit, the calculation might use the new values for parameters such as true positive profit or false positive loss to recompute the profit for each test case.

[0080] Regardless of how the recomputation is performed, the result of operation 216 is a new set of updated performance metric values 116b that correspond one-to-one with the test cases represented by the plurality of test case data elements 112. This allows for a direct comparison between the original 1 16a and updated 1 16b metric values, enabling users to see the immediate impact of their parameter adjustment on the first machine learning model 102a’s evaluated performance. By recomputing the metric values based on the second value 132b, this step enables dynamic, interactive exploration of the first machine learning model 102a’s performance under different scenarios. This supports better decision-making about model deployment and optimization, as users can immediately see how their parameter adjustments affect the first machine learning model 102a’ s performance metrics. [0081] The visual output module 118 may display visual output 120b representing the plurality of updated first performance metric values 116b (FIG. 2, operation 218). As this implies, the updated visual output 120b may differ in any of a variety of ways from the previous visual output 120a. This step may include providing immediate, visual feedback on how the change in the value of the first parameter 130a from the first value 132a to the second value 132b affects the first machine learning model 102a’s performance evaluation. The visual output can take various forms, similar to those described in step 210, but now reflecting the updated performance metric values 116b.

[0082] Tire method 200 may perform operations 216 and 218 in real-time, e.g., with little or no delay that is perceptible to the user 126 in response to providing the first parameter adjustment input 124. As a result, the user 126 can instantly see the impact of their parameter adjustments on the first machine learning model I O2a's performance, enabling rapid iteration and exploration of different scenarios. Such visual representations, especially when updated repeatedly in real-time in response to parameter adjustment input, make it easier for users, especially non-technical stakeholders, to grasp complex relationships between parameters and performance metrics. Such real-time updates allow users to quickly understand the impact of changes to parameter settings by observing how changes affect the visual representations of performance. Users can dynamically explore the model's behavior under different conditions, leading to more comprehensive analysis and better-informed decisions. Fast, responsive updates create a seamless interaction, encouraging users to explore more thoroughly and gain deeper insights.

[0083] As described herein, certain embodiments of the present invention may perfonn various actions in “real-time.” As an example of what is meant by “real-time,” as the user 126 adjusts the value of the first parameter 130a, such as by moving a slider for the true positive profit or false positive loss, the system 100 may immediately generate and display the corresponding visual output 120b. Tire system 100 may provide immediate or near-immediate visual feedback as the user 126 interacts with graphical elements. For instance, when the user 126 drags a slider representing the true positive profit or false positive loss on a graph, the system 100 may update tire visual output representing the corresponding metrics continuously during the drag operation, thereby causing the entire shape of the profit curve to change in real-time.

[0084] The system 100 may generate and display the visual output 120b within no more than a particular amount of time (e.g., 16ms or 32ms) after the system 100 receives the first parameter adjustment input 124. This ensures a smooth, responsive feel to user interactions. As another example, for dynamic graph updates, such as reshaping a profit curve, the system 100 may complete the redraw (e.g.. the generation of the 120b) within 32 milliseconds of receiving the parameter adjustment input 124. More generally, the system 100 may complete all visual updates, including complex recalculations and redraws, within 0.05 seconds of receiving the parameter adjustment input 124, thereby ensuring that users perceive the updates as occurring in real-time.

[0085] Embodiments of tire present invention are particularly well-suited for evaluating models that drive binary decisions, such as whether to audit a transaction, contact a customer, or approve a loan application. As this implies, any one or more of the plurality of machine learning models 102a-/i may be binary models and/or models that drive binary decisions. For example, in some embodiments, all of tire plurality of machine learning models 102a-/? are binary’ models.

[0086] A binary model is a type of model that predicts outcomes that fall into two distinct categories or classes. These models output a predictive score that reflects the expected chance that the outcome will be one of two possible outcomes. The pair of possible outcomes would typically represented as a pair such as "yes" or "no", " 1" or "0", or "true" or "false". For example, a binary model’s predictive score could be a probability, reflecting the chances for each case that the outcome will be “true” (rather than “false”). Binary models are commonly used in applications such as fraud detection (fraudulent vs. legitimate transactions), customer chum prediction (will a customer leave or stay), medical diagnosis (presence or absence of a condition), spam detection (spam vs. non-spam emails), and credit approval (approve or deny a loan application). Hie output of a binary model is often a probability or score representing the likelihood of an instance belonging to one of the two classes. A decision threshold is then applied to this score to make the final binary classification.

[0087] Although binary’ models directly align with binary’ decisions, both binary and non-binary models may also be used to drive binary decisions, such as by applying thresholds to their outputs. For example, a model predicting customer spending amounts can be used to make a binary decision by setting a threshold: contact customers predicted to spend above a certain amount.

[0088] The first performance metric specified by the first performance metric specification 106a may include a second parameter (not shown). More generally, the first performance metric may include any number of parameters. Any of these parameters (e.g., the second parameter) may, for example, be any of the following: a confidence threshold, a parameter of the confidence threshold, a false positive cost, a parameter of the false positive cost, a false negative cost, a parameter of the false negative cost, a true positive gain, and a parameter of the true positive gain.

[0089] A parameter of a parameter refers to a variable that modifies or influences another parameter used in calculating a performance metric. Parameters of parameters allow for more nuanced and flexible evaluation of models by introducing additional layers of customization and adjustment. For example, in the case of fraud detection, instead of (or in addition to) using a fixed false negative (FN) cost, embodiments of the present invention allow for a more dynamic approach, in which the primary parameter may be the transaction amount, which varies for each case in the test set, and a secondary? parameter (a parameter of the transaction amount parameter) may be a “transaction size fraud loss factor” that determines what percentage of the transaction amount is actually lost due to fraud. Such a parameter of a parameter enables users to adjust not only the base parameter, but also how that parameter is calculated or applied, allowing for more precise and nuanced model evaluation.

[0090] The description above refers to the test set data 110, which may exist before any of the methods disclosed herein are performed, and which may have been generated in any way. Some embodiments of the present invention may, however, generate the test set data 110. Generating the test set data 110 may include, for each of the plurality of test case data elements 112 (in tire test set data 110) corresponding to a corresponding test case C: (1) providing the test case C as an input to the first machine learning model 102a; and (2) using the first machine learning model 102a to output a score based on the test case C. The same is equally applicable to any of the plurality of machine learning models 102a-w.

[0091] Tire test set data 110 may take any of a variety of fonns. It may be helpful, however, forthe plurality of test case data elements 112 to be sorted by their score outputs within tire test set data 110. To achieve this, the method 200 may include (e.g., in operation 208): (1) sorting the plurality of test case data elements 112 by their score outputs to produce a sorted plurality of test case data elements; and (2) for each test case data element E in the sorted plurality of test case data elements E, corresponding to a corresponding test case C, in sorted order: computing the first performance metric value corresponding to test case data element E based on the value of the dependent variable for test case C. Sorting the plurality of test case data elements by their score outputs may include sorting the plurality of test case data elements by their score outputs in ascending order or in descending order.

[0092] Such sorting has a variety of benefits. For example, sorting the test case data elements 112 by their score outputs allows for more efficient computation of performance metrics. This is particularly useful when calculating metrics that depend on the ranking of predictions, such as lift or cumulative gains. Furthermore, by having tire plurality of test case data elements 112 sorted, it becomes easier to analyze the effect of different decision thresholds on model performance. Uris is valuable for optimizing the first machine learning model 102a's operational settings and understanding trade-offs between different performance metrics (e.g.. false positive rate vs. profit). Sorted data elements also enable the creation of more informative visualizations, such as lift curves or cumulative gain charts, which are valuable for evaluating model performance and communicating results to stakeholders. Furthermore, computing performance metrics for each test case data element in sorted order allows for efficient incremental calculations. This is particularly beneficial for metrics that accumulate values across the dataset, such as cumulative profit or cost.

[0093] When evaluating multiple models simultaneously, having sorted test case data elements for each model simplifies the process of comparing their performance. Sorted data elements also enable rapid recalculation of performance metrics when users interactively adjust decision thresholds, enhancing the system 100’s responsiveness and user experience.

[0094] Computing the first performance metric value corresponding to test case data element E based on the value of the dependent variable for test case C may include: processing test case data elements preceding test case data element E in the sorted plurality of test case data elements using a first operation; and processing test case data elements following (and optionally including) test case data element E in the sorted plurality of test case data elements using a second operation that differs from the first operation. By using different operations for the preceding and following elements, the system 100 can optimize the computation process, potentially reducing the overall processing time. For example, using different operations for the preceding and following elements allows for the efficient calculation of cumulative metrics, such as aggregate cost, by processing the data in a sorted order and applying different operations based on the position of each element.

[0095] Tire computed first performance metric values may be accumulated as they are being computed. By accumulating the metric values during computation, the system 100 can maintain a running total without needing to store individual values for each test case data element. Uris is particularly beneficial when dealing with large datasets, as it reduces the memory footprint of the evaluation process. Accumulating values as they are computed also allows for real-time updates of the performance metric, which is especially useful for interactive visualizations or when adjusting parameters dynamically. Accumulation also facilitates incremental metric calculations, which is particularly useful for metrics that depend on cumulative values, such as lift or cumulative gain. As the system 100 processes each test case data element in sorted order, it can efficiently update the accumulated metric value without needing to reprocess previously computed elements.

[0096] As mentioned above, each of the plurality of test case data elements 1 12 may further include a corresponding value of an additional independent variable. In this case, the method 200 may: filter the test set data to only include test cases data elements corresponding to test cases that satisfy a condition related to the additional independent variable; and only the plurality of first performance metric values corresponding to the filtered test set data may be computed. Such filtering can be useful in for a variety of purposes. For example, filtering test cases based on demographic variables like age, gender, location, or income level allows evaluation of model performance across different population segments to identify any disparities or biases. As another example, filtering based on customer attributes such as purchase history, loyalty status, or customer lifetime value enables assessment of model effectiveness for different customer segments, which is crucial for targeted marketing or personalized services. For financial models, filtering based on transaction amount, type, or frequency is especially useful for fraud detection models to evaluate performance across different transaction profiles. Filtering test cases by time periods (e.g., weekdays vs. weekends, seasons, or specific date ranges) helps to assess model performance under different temporal conditions. Filtering based on specific product categories or service types assists in evaluating model perfonnance across different business lines or offerings. These and other types of filtering enable targeted analysis, support fairness assessments, and provide insights into model performance across various business-relevant segments, enhancing the overall utility and flexibility of the model evaluation process.

[0097] Tire first perfonnance metric specification 106a may further specify one or more additional variables, and each of the plurality of test case data elements 112 may further include a corresponding value of the additional variable(s). Computing the plurality of first performance metric values 116a may include: computing, for each test case data element E in tire plurality of test case data elements 112, the corresponding first perfonnance metric value based on the specification of the first performance metric 106a and test case data element E's score output, dependent variable value, and additional variable value(s).

[0098] Using such an additional variable offers several advantages. For example, by incorporating additional variables into the performance metric calculation, the system 100 can provide a more comprehensive and nuanced evaluation of model perfonnance. This is particularly useful for scenarios where the impact of a prediction varies based on factors beyond just the dependent variable. Furthermore, the use of an additional variable enables more accurate representation of real- world scenarios because, in many applications, the impact of a model's prediction can vary significantly based on additional factors. For example, in fraud detection, the cost of a false negative (failing to detect fraud) is directly related to the transaction amount: a $5,000 fraudulent transaction has a much higher impact than a $500 fraudulent transaction. By incorporating such variables, the performance metric can more accurately reflect the real-world consequences of model decisions.

[0099] Tire techniques disclosed herein may be applied to scores output by one or more additional models (in addition to the scores output by the first machine learning model 102a). For example, each of the plurality of test case data elements 112 E may further include data representing a score output by a second machine learning model when test case C is provided as an input to the second machine learning model. Operation 208 may further include computing a plurality' of second performance metric values corresponding to the second model score outputs. The visual output 120a may include visual output representing the second model score outputs, and displaying the visual output 120a in operation 210 may include displaying the visual output representing the plurality of second performance metrics values.

[0100] By including score outputs from multiple models, the system 100 may enable direct comparison of different models' performance on tire same test set. This facilitates more comprehensive model evaluation and selection. Displaying visual output for multiple models simultaneously allows for intuitive comparison of their performance characteristics. This can be particularly useful for identifying strengths and weaknesses of different models across various segments of the data. Computing performance metrics for multiple models within the same process can be more efficient than evaluating each model separately, especially when dealing with large test sets. This approach also allows for easy comparison of different model versions, supporting iterative model development and optimization.

[0101] As mentioned above, the plurality of machine learning models 102a-w may include models of at least tw o different types (e g., a neural network and a decision tree, as just one example). Each test case data element E may further include data representing a score output by the second machine learning model when test case C is provided as an input to the second machine learning model. In that case, the method 200 may include selecting a second model from the plurality of models, wherein the second model is distinct from the first model. The method 200 may also include performing operations 206 (computing the plurality of performance metric values) and 208 (displaying visual output) using the second model.

[0102] By applying tire same evaluation process to multiple models, the system 100 enables users to directly compare the performance of different model types on the same test set, facilitating more informed model selection decisions. In addition, the ability to select and evaluate different models within the same framework supports a more efficient and comprehensive model development process. Furthermore, by evaluating multiple models simultaneously, the system can facilitate the development of ensemble models or other advanced techniques that combine predictions from different model types. Displaying visual output for multiple models allows for intuitive comparison of their performance characteristics.

[0103] The first performance metric 106a may define the first performance metric as a parameterized mathematical formula that includes model score and dependent variable value as parameters. Tire first performance metric 106a may define the first performance metric as a parameterized algebraic expression that includes model score and dependent variable value as parameters. Identifying the specification of the first performance metric may include: (1) receiving first performance metric input from the user 126; and (2) generating the specification of the first performance metric based on the first performance metric input. This approach allows users to define metrics that are tailored to their specific use cases and business needs, going beyond standard off- the-shelf metrics. Tire parameterized nature of the metric definition allows for easy adjustment and fine-tuning of the evaluation criteria. By incorporating both model scores and dependent variable values, the metric can provide a more holistic assessment of model perfonnance.

[0104] The system 100 and method 200 may be applied to any number of additional performance metrics, in addition to the first performance metric specified by the first performance metric 106a. As this implies, the method 200 may further include: identifying a specification of a second performance metric having a second parameter having a third value, wherein the second parameter comprises a decision threshold; computing, based on each of the plurality of test case data elements 112, a corresponding second performance metric value that represents performance of the first machine learning model 102a relative to the plurality of dependent variable values, using the second performance metric, thereby computing a plurality of second performance metric values corresponding to the plurality of test case data elements; and displaying visual output (within the visual output 120a) representing the plurality of second pcrfon ancc metric values.

[0105] By incorporating multiple performance metrics, the system 100 enables a more thorough assessment of model performance, capturing different aspects of model behavior and business relevance. Different metrics may be more appropriate for different use cases or business objectives. Moreover, organizations usually must strike a balance between competing goals that are reflected by competing metrics. For example, an organization could aim to maximize the savings delivered by deploying a fraud detection model, and yet simultaneously aim to minimize the number of false positives incurred by that deployment, which inconvenience legitimate (non-fraudulent) transactors. Those two goals and their metrics conflict, since increasing the former typically also increases the latter. To understand and evaluate how to strike a balance between these two competing metrics, it is helpful for the user to visualize their interaction and the tradeoff options. This approach also allows users to evaluate models using metrics that are most relevant to their specific needs. Presenting multiple metrics simultaneously enables stakeholders to make more informed decisions about model selection, deployment, and optimization.

[0106] The method 200 may further include fixing the first performance metric’s decision threshold to the second performance metric’s decision threshold, which may include: receiving user input specifying a change to either the first performance metric’s decision threshold or the second performance metric’s decision threshold; changing the selected decision threshold based on tire user input received; and changing the other decision threshold to be equal to the selected decision threshold. For example, if the user input specifies to change the first performance metric’s decision threshold, then the method 200 may both change the first performance metric’s decision threshold based on the user input and change the second performance metric’s decision threshold to be equal to the changed first performance metric’s decision threshold.

[0107] By fixing the decision thresholds of multiple performance metrics (c.g., tying their values to each other), the system 100 can ensure that different metrics are evaluated at the same operating point (e.g., decision threshold setting), providing a more consistent and comparable assessment of model performance. This enables an “apples to apples” comparison between two metrics, since their values are established in light of the identical decision threshold value. When this feature is in use, users only need to adjust one threshold, and the system 100 automatically aligns the other threshold, reducing the complexity of managing multiple metrics. Aligning decision thresholds in this way makes it easier to compare tire performance of different metrics or models at tire same operating point, facilitating more informed decision-making.

[0108] Computing the plurality of first performance metric values in operation 208 may include: associating the test set data 110 with a plurality of bins, wherein each of the plurality of bins is associated with a distinct subset of the plurality of test case data elements 112; and, for each of the plurality of bins: (1) computing a corresponding plurality of intennediate values of the first perfonnance metric for that bin; and (2) computing the corresponding first performance metric value for that bin based on the bin's plurality of intermediate values of the first perfonnance metric.

[0109] More generally, operation 208 may include binning the test set data 110 and using intermediate values to efficiently calculate performance metrics. The system 100 may associate the test set data 110 with a plurality of bins, where each bin contains a distinct subset of test case data elements. This reduces the granularity of the data, allowing for more efficient processing.

[0110] For each bin, a corresponding plurality of intermediate values of the first perfonnance metric may be computed. These intermediate values may be designed to contain all the necessary information to calculate the performance metric value for that bin given a change to any parameter without needing to process individual test cases within the bin.

[0111] Tire corresponding first performance metric value for each bin may be computed based on the bin's plurality of intermediate values and all parameter values. Uris significantly reduces computational time, especially when dealing with large datasets or when recalculating metrics due to parameter changes.

[01 12] The binning process decreases the resolution of the decision threshold settings. For example, with 10,000 test cases binned into 1,000 bins of 10 each, the threshold can only be placed in 1,001 positions instead of 10,001. This trade-off between precision and computational efficiency is generally acceptable for applications, since it enables real-time response to parameter changes, and yet can be switched off if needed for a more granular decision threshold setting.

[0113] When parameters such as false positive (FP) or false negative (FN) costs are changed, the entire plurality of metric values can be quickly recalculated using only the intermediate values and the updated parameters. This allows for real-time updates and interactive exploration of model performance under different scenarios.

[0114] The binning techniques described above may be applied to different types of perfonnance metrics. For example, in a fraud detection model, the intermediate values could be the number of false positives and false negatives within each bin, allowing for quick calculation of an aggregate cost metric. This approach enhances tire overall efficiency and responsiveness of the model evaluation process, particularly when dealing with large datasets or when providing interactive visualizations that require real-time updates based on user input.

[0115] Recall that the method 200 may include computing, based on each of the plurality of test case data elements, a corresponding updated first performance metric value using the second value of the first parameter of the first performance metric, thereby computing a plurality of updated first performance metric values corresponding to the plurality of test case data elements. Such computing of tire plurality of updated first performance metric values may include, for each of the plurality of bins, computing the corresponding updated first performance metric value for that bin based on the bin's plurality of intennediate values of the first performance metric, thereby forgoing the need for a lengthy processing of the complete test case data to achieve this computation.

[0116] By utilizing the pre-computed intermediate values for each bin, the system 100 can quickly recalculate the performance metric values when parameters are changed, without needing to process individual test cases again. This efficient recalculation enables real-time updates of performance metrics in response to user input, facilitating interactive exploration of model perfonnance under different scenarios. This method allows for efficient handling of large datasets by reducing the computational requirements of updating performance metrics, making it feasible to work with extensive test sets. Users can easily adjust parameters and immediately see the impact on performance metrics, and how those metrics vary across a plurality of decision threshold values, supporting more comprehensive model evaluation and optimization. By relying on intermediate values rather than raw test case data, this approach can significantly increase speed, especially for large datasets.

[0117] The techniques disclosed herein, such as the binning techniques disclosed above, may be implemented in any of a variety of ways. For example, such techniques may be implemented using any of a variety of programming languages, such as JavaScript and/or TypeScript. Such techniques may be implemented using a software-as-a-service (SaaS) model, in which some computations are executed on a server and the results of those computations are provided over a network to a client, which performs functions such as displaying output based on the computation results. Embodiments, including SaaS embodiments, can also perform all or some computations on the client side, in order to improve the speed of responsiveness as visual output changes in response to user parameter changes. More generally, the techniques disclosed herein may be performed using server-side processing, client-side processing, or any combination thereof.

[0118] Embodiments of the present invention may pemiit users (e.g.. the user 126) to create customized model evaluation capabilities, called widgets, which can be integrated into the system 100, re-used by the creating user, and optionally shared with other users. This user-definable widget feature within embodiments of the present invention addresses the “long tail” of model evaluation capabilities; while embodiments of the present invention provide a core suite of model evaluation capabilities, tire widget feature enables users to create additional, specialized evaluation tools that cater to specific use cases or industries not covered by the standard offerings.

[0119] By allowing users to define and implement their own widgets, embodiments of the present invention become a scalable, general-purpose model evaluation platform. This approach significantly expands the range of evaluation metrics and visualizations available to users, ensuring that the system 100 can adapt to diverse and evolving needs across various industries and machine learning applications.

[0120] The user-definable widget feature may also foster a collaborative ecosystem in which users can contribute their widgets to an expanding library, making them accessible to other users and organizations. This aspect of embodiments of the present invention draws parallels to app store models, such as the Salesforce AppExchange marketplace, in which users can both create and benefit from a growing collection of specialized tools.

[0121] Although the following example refers to the user 126, this is merely an example. To define a custom widget, the user 126 may provide input that can take various fonns, such as code written in a programming or scripting language. The system allows flexibility in the implementation language, with options like JavaScript for web-based implementations, although any suitable programming language may be used.

[0122] One example of a custom widget that the system 100 may enable the user 126 to define is a custom performance metric that calculates model performance based on one or more user-specified parameters and variables. Once the user 126 has defined such a custom widget that specifies a custom performance metric, the system 100 may use the custom performance metric in any of the ways disclosed herein in connection with performance metrics, such as in any of the ways disclosed herein in connection with the first performance metric 106a.

[0123] Another example of a custom widget that the system 100 may enable the user 126 to define is a custom visual output type, such as a graph. Once the user 126 has defined such a custom widget that specifies a custom visual output type, the system 100 may use the custom visual output type in any of the ways disclosed herein in connection with the visual output 120a-b.

[0124] The system 100 may enable the user 126 to incorporate various parameters and variables into a custom widget, such as any one or more of the following, in any combination: one or more independent variables of the test cases 110, one or more dependent variables of the test cases 110, and a confidence threshold.

[0125] The system 100 may enable the user 126 to include one or more user interface elements, such as text fields, checkboxes, sliders, and dropdown lists, in the definition of a custom widget, and to associate those user interface elements with corresponding parameters and/or variables. For example, the user 126 may include a slider or other user interface element in the definition of a custom widget, and associate that element with a confidence threshold, so that when the custom widget is executed, the user 126 (or another user) may interact with the element to change the value of the confidence threshold.

[0126] Regardless of the type of custom widget that has been defined, once the user 126 has defined a custom widget, the system 100 may store the definition of that custom widget. Subsequently, the user 126 w ho created the widget or other users may cause the system 100 to execute the widget to perform its defined function. This allows for reuse and sharing of custom evaluation capabilities across users and organizations.

[0127] For example, the user 126 may create a widget that calculates and displays a profit curve with interactive sliders for adjusting parameters like true positive profit and false positive loss. Another example is a custom widget that calculates the aggregate cost of a fraud detection model with an additional slider for vary ing the transaction magnitude threshold.

[0128] A user 126 may position multiple widgets on the screen simultaneously in order to provide the user visual access to a greater number of interactive capabilities and visual elements at the same time. This positioning is akin to a dashboard, but differs in that the word “dashboard” in industry typically refers to retrospective counts across various subsegments (“slicing and dicing”), whereas a widget created within an embodiment of this system serves an entirely different purpose: evaluating the performance of a predictive model.

[0129] As an example, an embodiment of the system 100 or method 200 may be used by a user as follows. Tire following is intended to be an overview that covers some valuable, but not all, potential uses of embodiments of the present invention, such as the system 100 and method 200.

[0130] During the development of one or more predictive models using an external model-training system, project personnel desire access to comprehensive model evaluation capabilities. The deployment purpose of the models could be, for example, to block or audit transactions or media content that is fraudulent, abusive, conveying misinformation, or otherwise undesirable.

[0131] The data scientist provides, to the system 100, the test case data elements 110 that encode the performance of the model(s) on test cases. Alternatively, an exterior model-training system may provide this data to the system 100 via an API.

[0132] Tire data scientist may also provide performance metric specifications 106a-m to the system 100. This may include, for example, metrics such as savings, false positive rate, and a gains curve. For each performance metrics specification, the data scientist may provide the associated parameters, such as false positive cost and false negative cost.

[0133] The data scientist and other users may then use the system 100 to display and view the visual output 120a. This provides a view⁷ of the models' performance in various ways, according to multiple organizational objectives, including increasing business metrics like savings and decreasing the false positive rate.

[0134] A business metric, such as savings, is based on business factors subject to change and uncertainty. Tire data scientist accounts for these factors by representing them as parameters within the performance metrics specifications 106a-/??. The system 100 allows each parameter to be altered by the user through a graphical user interface element such as a slider.

[0135] Tire above steps taken by the data scientist serve to configure the system 100 for a given model development project. Once configured, the interface is user-friendly, understandable, and relevant for any user, including non-data scientist business professionals, stakeholders, and decision makers.

[0136] Users, including non-data scientists, may use the system 100 to assess the potential value of the model(s) in any of a variety of ways, such as one or more of the following:

• Examining charts representing multiple values for one metric, such as savings. This reveals the maximum potential savings and the decision threshold setting to achieve that savings in model deployment.

• Changing parameters to see how altering business factors affects the metrics and curve shapes. This may include, for example, adjusting the decision threshold and other factors, such as false positive cost and false negative cost.

• Exploring ranges of possible parameter values to understand tire model's viability under different conditions.

• Analyzing tradeoffs between metrics that reflect competing business objectives, allowing project personnel to make informed decisions about model deployment.

[0137] Tire model assessment steps described above and elsewhere herein may be taken at various phases of the model development process, such as any one or more of the following:

• Early proof-of-concept ‘"prototype” modeling to detennine a rough estimate of a modeling project's value.

• Throughout the entire model-development process leading up to model deployment, at each iteration.

• Before the first deployment of the model, to decide which model to deploy and how to deploy it, including determining the value at which to set the decision threshold

• After model deployment while monitoring the model’s perfonnance, to assess its performance over more recent test data and consider modifications to its deployment, such as changing the value at which the decision threshold is set.

• When refreshing the model with an updated model trained over more recent training data, comparing the existing deployed model (the “champion”) with the new model (the “challenger”). [0138] The use case described above demonstrates that embodiments of the system 100 provide a flexible, comprehensive tool for model evaluation throughout the entire lifecycle of a machine learning project, enabling data scientists and business stakeholders to make informed decisions based on both technical and business-oriented performance metrics.

[0139] As described elsewhere herein, certain embodiments of the present invention include efficient data processing techniques that can process large datasets, such as to enable real-time computations and visual updates for datasets that would be impractical or impossible to process manually or with conventional methods. For example, embodiments of the present invention may apply any of the techniques disclosed herein to datasets containing large numbers of individual test cases, such as datasets including at least 100,000 test cases, at least 1 million test cases, at least 10 million test cases, at least 100 million test cases, or at least 1 billion test cases.

[0140] Embodiments of the present invention may be used in connection with datasets in which each of the test cases includes at least 10 independent variables, at least 100 independent variables, at least 1000 independent variables, at least 1 million independent variables, or at least 1 billion independent variables.

[0141] As some particular examples, embodiments of the present invention may be used in connection with datasets containing: at least 1 million test cases, each with at least 100 independent variables and at least one dependent variable; at least 10 million test cases, each with at least 50 independent variables and at least one dependent variable; at least 100 million test cases, each with at least 20 independent variables and at least one dependent variable.

[0142] Using the binning method described earlier, embodiments of the present invention may efficiently process such large datasets in real-time. For instance, with 1 million test cases binned into 1,000 bins, after calculating intermediate values for each bin, embodiments of the present invention may perform any one or more of the following computations within no more than 10 milliseconds: compute final metric values for all 1,000 bins based on the intermediate values; update the visual output (e.g., a profit curve or gains curve) with 1,000 data points. These computations may be perfonned and the visual output updated within 0 milliseconds of receiving a parameter adjustment input from the user, thereby providing a smooth, interactive user experience, even when the visual output is based on computations that involve the entirety of a dataset containing millions of test cases.

[0143] Furthermore, the real-time processing capabilities of embodiments of the present invention extend to even larger datasets. For example, when evaluating a model across 100 million test cases, embodiments of the present invention may still provide responsive visual updates by intelligently sampling or binning the data using the techniques disclosed herein. This allows for interactive exploration of model performance on datasets that would be completely infeasible to analyze manually or with traditional methods.

[0144] It is to be understood that although the invention has been described above in terms of particular embodiments, the foregoing embodiments are provided as illustrative only, and do not limit or define the scope of the invention. Various other embodiments, including but not limited to the following, are also within the scope of the claims. For example, elements and components described herein may be further divided into additional components or joined together to form fewer components for performing the same functions.

[0145] Any of the functions disclosed herein may be implemented using means for performing those functions. Such means include, but are not limited to. any of the components disclosed herein, such as the computer-related components described below.

[0146] Tire techniques described above may be implemented, for example, in hardware, one or more computer programs tangibly stored on one or more computer-readable media, firmware, or any combination thereof. Tire techniques described above may be implemented in one or more computer programs executing on (or executable by) a programmable computer including any combination of any number of tire following: a processor, a storage medium readable and/or writable by tire processor (including, for example, volatile and non-volatile memory and/or storage elements), an input device, and an output device. Program code may be applied to input entered using the input device to perform the functions described and to generate output using the output device.

[0147] Embodiments of the present invention include features which are only possible and/or feasible to implement with the use of one or more computers, computer processors, and/or other elements of a computer system. Such features are either impossible or impractical to implement mentally and/or manually. For example, embodiments of the present invention provide a significant technical improvement to computer-based model evaluation systems by enabling more efficient and flexible analysis of model performance, particularly through an innovative binning approach and use of intermediate values. Tire binning approach allows the system to process and analyze large volumes of test data more efficiently. By grouping test cases into bins, the system reduces the computational complexity of performance metric calculations, enabling it to handle datasets that would be impractical to process individually. The use of intermediate values for each bin allows for quick recalculation of performance metrics when parameters are changed. This is particularly important for interactive model evaluation, where users may w ant to adjust parameters and immediately see tire impact on model performance. Tire combination of binning and intermediate values enables real-time updates of perfonnance metrics and visualizations. This allows users to explore model performance under different scenarios and parameter settings without experiencing significant delays, enhancing the overall user experience and facilitating more thorough model evaluation.

[0148] Embodiments of the present invention solve the technical problem of providing real-time, interactive model evaluation for large datasets and complex metrics, which is particularly relevant in the context of modem machine learning applications that often involve massive amounts of data and sophisticated performance criteria. For example, the invention employs a binning approach that allows for processing and analyzing large volumes of test data more efficiently. By grouping test cases into bins, the system reduces the computational complexity of performance metric calculations, enabling it to handle datasets that would be impractical to process individually. Tire use of intermediate values for each bin allows for rapid recalculation of performance metrics when parameters are changed. This enables real-time updates of performance metrics and visualizations, allowing users to explore model performance under different scenarios and parameter settings without experiencing significant delays. The intermediate value approach allows for the efficient calculation of complex performance metrics that may require multiple components or steps. This enables more sophisticated model evaluation without sacrificing computational efficiency, addressing the need for evaluating models using sophisticated performance criteria. This approach scales well with increasing dataset sizes. As the number of test cases grows, the binning method continues to provide efficient performance, allowing the system to maintain responsiveness even with very large datasets. This is valuable for modem machine learning applications that often deal with massive amounts of data.

[0149] Embodiments of the present invention are necessarily rooted in computer technology. This machine-centric approach distinguishes embodiments of the present invention from mere mental processes or abstract ideas. For example, embodiments of the present invention are specifically designed to evaluate and analyze the performance of complex machine learning models, including neural networks, large language models, decision trees, and ensemble models. These models are inherently computational and cannot be meaningfully evaluated without computer technology. The innovative binning approach and use of intermediate values enable efficient handling of large datasets that would be impractical or impossible to process manually. Tire ability of embodiments of the present invention to provide real-time updates of performance metrics and visualizations in response to user input relies on computational speed and efficiency that can only be achieved through computer technology.

[0150] Any claims herein which affirmatively require a computer, a processor, a memory, or similar computer-related elements, are intended to require such elements, and should not be interpreted as if such elements are not present in or required by such claims. Such claims are not intended, and should not be interpreted, to cover methods and/or systems which lack the recited computer-related elements. For example, any method claim herein which recites that the claimed method is performed by a computer, a processor, a memory , and/or similar computer-related element, is intended to, and should only be interpreted to, encompass methods which are perfonned by the recited computer- related element(s). Such a method claim should not be interpreted, for example, to encompass a method that is performed mentally or by hand (e.g., using pencil and paper). Similarly, any product claim herein which recites that the claimed product includes a computer, a processor, a memory', and/or similar computer-related element, is intended to, and should only be interpreted to, encompass products which include the recited computer-related element(s). Such a product claim should not be interpreted, for example, to encompass a product that does not include the recited computer-related element(s).

[0151] Each computer program within the scope of the claims below may be implemented in any programming language, such as assembly language, machine language, a high-level procedural programming language, or an object-oriented programming language. The programming language may, for example, be a compiled or interpreted programming language.

[0152] Each such computer program may be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a computer processor. Method steps of the invention may be performed by one or more computer processors executing a program tangibly embodied on a computer-readable medium to perform functions of the invention by operating on input and generating output. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, the processor receives (reads) instructions and data from a memory’ (such as a read-only memory and/or a random access memory) and writes (stores) instructions and data to the memory. Storage devices suitable for tangibly embodying computer program instructions and data include, for example, all forms of non-volatile memory, such as semiconductor memory devices, including EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD- ROMs. Any of the foregoing may be supplemented by, or incorporated in, specially-designed ASICs (application-specific integrated circuits) or FPGAs (Field-Programmable Gate Arrays). A computer can generally also receive (read) programs and data from, and write (store) programs and data to, a non-transitory computer-readable storage medium such as an internal disk (not shown) or a removable disk. These elements will also be found in a conventional desktop or workstation computer as well as other computers suitable for executing computer programs implementing the methods described herein, which may be used in conjunction with any digital print engine or marking engine, display monitor, or other raster output device capable of producing color or gray scale pixels on paper, film, display screen, or other output medium. [0153] Any data disclosed herein may be implemented, for example, in one or more data structures tangibly stored on a non-transitory computer-readable medium. Embodiments of the invention may store such data in such data structure(s) and read such data from such data structure(s).

[0154] Any step or act disclosed herein as being performed, or capable of being performed, by a computer or other machine, may be performed automatically by a computer or other machine, whether or not explicitly disclosed as such herein. A step or act that is performed automatically is performed solely by a computer or other machine, without human intervention. A step or act that is performed automatically may, for example, operate solely on inputs received from a computer or other machine, and not from a human. A step or act that is performed automatically may, for example, be initiated by a signal received from a computer or other machine, and not from a human. A step or act that is performed automatically may, for example, provide output to a computer or other machine, and not to a human.

[0155] Tire terms “A or B,” “at least one of A or/and B,” “at least one of A and B,” “at least one of A or B,” or “one or more of A or/and B' used in the various embodiments of the present disclosure include any and all combinations of words enumerated with it. For example, “A or B,” “at least one of A and B” or “at least one of A or B” may mean: (1) including at least one A, (2) including at least one B, (3) including either A or B, or (4) including both at least one A and at least one B.

[0156] Although terms such as “optimize” and “optimal” are used herein, in practice, embodiments of tire present invention may include methods which produce outputs that are not optimal, or which are not known to be optimal, but which nevertheless are useful. For example, embodiments of the present invention may produce an output which approximates an optimal solution, within some degree of error. As a result, terms herein such as “optimize” and “optimal” should be understood to refer not only to processes which produce optimal outputs, but also processes which produce outputs that approximate an optimal solution, within some degree of error.

Claims

Claims What is claimed is:

1. A computer-implemented method for evaluating the performance of a first model, the method comprising:

(A) identifying the first model;

(B) identifying a specification of a first performance metric having a first parameter having a first value, wherein the first parameter comprises a decision threshold:

(C) receiving test set data associated with a test set, the test set data comprising a plurality of test case data elements corresponding to a plurality of test cases in the test set, the plurality of test case data elements including a plurality of score outputs and a plurality of dependent variable values, each test case data element E in the plurality of test case data elements corresponding to a corresponding test case C. each test case data element E comprising: (1) data representing a score output by the first model when test case C is provided as an input to the first model; and (2) data representing a value of a dependent variable for test case C;

(D) computing, based on each of the plurality of test case data elements, a corresponding first performance metric value that represents performance of the first model relative to the plurality of dependent variable values, using the first performance metric, thereby computing a plurality of first performance metric values corresponding to the plurality of test case data elements;

(E) displaying visual output representing the plurality of first performance metric values;

(F) receiving, from a user, a first parameter adjustment input;

(G) updating the first parameter of the first performance metric to a second value of the first parameter of the first performance metric, based on the first parameter adjustment input, wherein the second value of the first parameter of the first performance metric differs from the first value of the first parameter of the first performance metric;

(H) computing, based on each of the plurality of test case data elements, a corresponding updated first performance metric value using the second value of the first parameter of the first performance metric, thereby computing a plurality of updated first performance metric values corresponding to the plurality of test case data elements; and

(I) displaying visual output representing the plurality of updated first performance metric values.

2. The method of claim 1, wherein the first model comprises a machine learning model.

3. Hie method of claim 1, wherein (F) comprises receiving the input from the user via a graphical user interface element.

4. The method of claim 3, wherein the graphical user interface element comprises a slider.

5. The method of claim 1, wherein a second parameter of the first performance metric is selected from the group consisting of a confidence threshold, a parameter of the confidence threshold, a false positive cost, a parameter of the false positive cost, a false negative cost, a parameter of the false negative cost, a true positive gain, and a parameter of the true positive gain.

6. The method of claim 1. wherein (C) further comprises generating the test set data, wherein generating the test set data comprises, for each test case data element E, corresponding to a corresponding test case C, in the plurality of test case data elements: providing the test case C as an input to the first model; and using the first model to output a score based on the test case C.

7. The method of claim 1, wherein the first model comprises a first machine learning model that is of a type selected from the group consisting of a neural network, a large language model, a decision tree, an ensemble model, and a logistic regression model.

8. The method of claim 1, wherein the first performance metric comprises a first technical metric.

9. The method of claim 8, wherein the first technical metric is of a type selected from the group consisting of f-score, precision, recall, sensitivity, specificity, lift, gains, false positive rate, false negative rate, true positive rate, true negative rate, false positive count, false negative count, true positive count, true negative count, accuracy, and log loss.

10. The method of claim 1, wherein the first performance metric comprises a first business metric.

11. The method of claim 10, wherein the first business metric is of a type selected from the group consisting of profit, revenue gain, revenue, savings, costs, loss, misclassification costs, customer acquisitions, customer attrition, employee acquisitions, employee attrition, labor reduction, loss ratio, marketing response rate, return on investment, debtor default rate, labor reduction, customer retention rate, amount of fraud prevented, amount of crime prevented, lives saved, deaths, and customer attrition rate.

12. The method of claim 1, wherein (D) comprises:

(D)(1) sorting the plurality of test case data elements by their score outputs to produce a sorted plurality of test case data elements;

(D)(2) for each test case data element E in the sorted plurality of test case data elements E, corresponding to a corresponding test case C, in sorted order: computing the first performance metric value corresponding to test case data element E based on the value of the dependent variable for test case C.

13. The method of claim 12, wherein computing the first performance metric value corresponding to test case data element E based on the value of the dependent variable for test case C comprises: processing test case data elements preceding test case data element E in the sorted plurality of test case data elements using a first operation; and processing test case data elements following and including test case data element E in the sorted plurality of test case data elements using a second operation that differs from the first operation.

14. The method of claim 12, wherein (D) further comprises accumulating the computed first performance metric values as they are being computed.

15. The method of claim 12, wherein (D)(1) comprises sorting the plurality of test case data elements by their score outputs in ascending order.

16. The method of claim 12, wherein (D)(1) comprises sorting the plurality of test case data elements by their score outputs in descending order.

17. The method of claim 1, wherein each of the plurality of test case data elements further includes a corresponding value of an additional independent variable, and wherein the method further comprises: filtering the test set data to only include test cases data elements corresponding to test cases that satisfy a condition related to the additional independent variable; and wherein (D) comprises computing the plurality of first performance metric values corresponding only to the filtered test set data.

18. The method of claim 1, wherein the specification of the first performance metric further specifies an additional variable, wherein each of the plurality of test case data elements further includes a corresponding value of the additional variable, and wherein (D) comprises: computing, for each test case data element E in the plurality of test case data elements, the corresponding first performance metric value based on the specification of the first performance metric and test case data element E's score output, dependent variable value, and additional variable value.

19. The method of claim 1, wherein each test case data element E further comprises: (3) data representing a score output by a second model when test case C is provided as an input to the second model; wherein (D) further comprises computing a plurality of second performance metric values corresponding to the scores output by the second model; and wherein (E) further comprises displaying visual output representing tire scores output by the second model.

20. The method of claim 1, wherein (A) comprises selecting the first model from a plurality of models.

21. The method of claim 20, wherein the plurality of models includes models of at least two types selected from the group consisting of neural network, large language model, decision tree, ensemble model, and logistic regression model.

22. The method of claim 19, wherein each test case data element E further comprises data representing a score output by the second model when test case C is provided as an input to the second model; wherein (A) comprises selecting the first model from a plurality of models; and wherein the method further comprises:

(J) selecting a second model from tire plurality of models, wherein the second model is distinct from the first model; and performing (D) and (E) using the second model.

23. The method of claim 22, wherein the first model and the second model are of different types.

24. The method of claim 1, wherein the specification of tire first performance metric defines the first performance metric as a parameterized mathematical formula that includes model score and dependent variable value as parameters.

25. The method of claim 24, wherein the specification of the first performance metric defines tire first performance metric as a parameterized algebraic expression that includes model score and dependent variable value as parameters.

26. The method of claim 24, wherein (B) comprises: (B)(1) receiving first performance metric input from the user; and

(B)(2) generating the specification of the first performance metric based on the first performance metric input.

27. The method of claim 1, further comprising:

(J) identifying a specification of a second performance metric having a second parameter having a third value, wherein the second parameter comprises a decision threshold;

(K) computing, based on each of the plurality of test case data elements, a corresponding second performance metric value that represents performance of the first model relative to the plurality of dependent variable values, using the second performance metric, thereby computing a plurality of second performance metric values corresponding to the plurality of test case data elements; and

(L) displaying visual output representing the plurality of second performance metric values.

28. The method of claim 27, further comprising fixing the first performance metric’s decision threshold to the second performance metric’s decision threshold, the fixing comprising:

(M) receiving user input specifying a change to a selected one of the first performance metric’s decision threshold and the second performance metric’s decision threshold;

(N) changing the selected decision threshold based on the user input received in (M); and

(O) changing the other decision threshold to be equal to the selected decision threshold.

29. The method of claim 1, wherein (D) comprises:

(D)( 1) associating the test set data with a plurality of bins, wherein each of the plurality of bins is associated with a distinct subset of the plurality of test case data elements;

(D)(2) for each of the plurality of bins:

(D)(2)(a) computing a corresponding plurality of intermediate values of the first performance metric for that bin; and (D)(2)(b) computing the corresponding first performance metric value for that bin based on the bin’s plurality of intermediate values of the first performance metric.

30. The method of claim 29, wherein (H) comprises:

(H)(1) for each of the plurality of bins, computing the corresponding updated first performance metric value forthat bin based on the bin’s plurality of intermediate values of the first performance metric.

31. A computcr-implcmcntcd system for evaluating tire performance of a first model, the system comprising at least one non-transitory computer-readable medium having computer program instructions stored thereon, the computer program instructions being executable by at least one computer processor to perfonn a method, the method comprising:

(A) identifying the first model;

(B) identifying a specification of a first performance metric having a first parameter having a first value, wherein the first parameter comprises a decision tiireshold;

(C) receiving test set data associated with a test set, the test set data comprising a plurality of test case data elements corresponding to a plurality of test cases in the test set, the plurality of test case data elements including a plurality of score outputs and a plurality of dependent variable values, each test case data element E in the plurality of test case data elements corresponding to a corresponding test case C, each test case data element E comprising: (1) data representing a score output by the first model when test case C is provided as an input to the first model; and (2) data representing a value of a dependent variable for test case C;

(D) computing, based on each of the plurality of test case data elements, a corresponding first perfonnance metric value that represents performance of the first model relative to the plurality’ of dependent variable values, using the first perfonnance metric, thereby computing a plurality of first performance metric values corresponding to the plurality of test case data elements;

(F) receiving, from a user, a first parameter adjustment input; (G) updating the first parameter of the first performance metric to a second value of the first parameter of the first performance metric, based on the first parameter adjustment input, wherein the second value of the first parameter of the first performance metric differs from the first value of the first parameter of the first performance metric:

32. The method of claim 31, wherein the first model comprises a machine learning model.

33. The system of claim 31, wherein (F) comprises receiving the input from the user via a graphical user interface element.

34. The system of claim 33, wherein the graphical user interface element comprises a slider.

35. The system of claim 31, wherein a second parameter of tire first perfonnance metric is selected from tire group consisting of a confidence threshold, a parameter of the confidence threshold, a false positive cost, a parameter of the false positive cost, a false negative cost, a parameter of the false negative cost, a true positive gain, and a parameter of the true positive gain.

36. The system of claim 31, wherein (C) further comprises generating the test set data, wherein generating tire test set data comprises, for each test case data element E, corresponding to a corresponding test case C, in the plurality of test case data elements: providing the test case C as an input to tire first model; and using the first model to output a score based on the test case C.

37. The system of claim 31, wherein the first model comprises a first machine learning model that is of a type selected from the group consisting of a neural network, a large language model, a decision tree, an ensemble model, and a logistic regression model.

38. The system of claim 31. wherein the first performance metric comprises a first technical metric.

39. The system of claim 38, wherein the first technical metric is of a type selected from the group consisting of f-score, precision, recall, sensitivity, specificity, lift, gains, false positive rate, false negative rate, true positive rate, true negative rate, false positive count, false negative count, true positive count, true negative count, accuracy, and log loss.

40. The system of claim 31, wherein tire first performance metric comprises a first business metric.

41. The system of claim 40. wherein the first business metric is of a type selected from the group consisting of profit, revenue gain, revenue, savings, costs, loss, misclassification costs, customer acquisitions, customer attrition, employee acquisitions, employee attrition, labor reduction, loss ratio, marketing response rate, return on investment, debtor default rate, labor reduction, customer retention rate, amount of fraud prevented, amount of crime prevented, lives saved, deaths, and customer attrition rate.

42. The system of claim 31, wherein (D) comprises:

43. The system of claim 42, wherein computing the first performance metric value corresponding to test case data element E based on the value of the dependent variable for test case C comprises: processing test case data elements preceding test case data element E in the sorted plurality of test case data elements using a first operation; and processing test case data elements following and including test case data element E in the sorted plurality of test case data elements using a second operation that differs from the first operation.

44. The system of claim 42. wherein (D) further comprises accumulating the computed first performance metric values as they are being computed.

45. The system of claim 42, wherein (D)(1) comprises sorting the plurality of test case data elements by their score outputs in ascending order.

46. The system of claim 42, wherein (D)(1) comprises sorting the plurality of test case data elements by their score outputs in descending order.

47. The system of claim 31, wherein each of the plurality of test case data elements further includes a corresponding value of an additional independent variable, and wherein the method further comprises: filtering the test set data to only include test cases data elements corresponding to test cases that satisfy a condition related to the additional independent variable; and wherein (D) comprises computing the plurality of first performance metric values corresponding only to the filtered test set data.

48. The system of claim 31, wherein tire specification of the first performance metric further specifies an additional variable, wherein each of the plurality of test case data elements further includes a corresponding value of the additional variable, and wherein (D) comprises: computing, for each test case data element E in the plurality of test case data elements, the corresponding first performance metric value based on the specification of the first performance metric and test case data element E' s score output, dependent variable value, and additional variable value.

49. The system of claim 31, wherein each test case data element E further comprises: (3) data representing a score output by a second model when test case C is provided as an input to the second model; wherein (D) further comprises computing a plurality of second performance metric values corresponding to the scores output by the second model: and wherein (E) further comprises displaying visual output representing the scores output by the second model.

50. The system of claim 31, wherein (A) comprises selecting the first model from a plurality of models.

51. The system of claim 50, wherein the plurality of models includes models of at least two types selected from the group consisting of neural network, large language model, decision tree, ensemble model, and logistic regression model.

52. The system of claim 49, wherein each test case data element E further comprises data representing a score output by the second model when test case C is provided as an input to the second model; wherein (A) comprises selecting the first model from a plurality of models; and wherein the method further comprises:

(J) selecting a second model from the plurality of models, wherein the second model is distinct from the first model; and performing (D) and (E) using the second model.

53. The system of claim 52, wherein the first model and the second model are of different types.

54. The system of claim 31, wherein the specification of the first performance metric defines the first performance metric as a parameterized mathematical formula that includes model score and dependent variable value as parameters.

55. The system of claim 54, wherein the specification of the first performance metric defines tire first performance metric as a parameterized algebraic expression that includes model score and dependent variable value as parameters.

56. The system of claim 54, wherein (B) comprises:

(B)(1) receiving first performance metric input from the user; and

57. The system of claim 31, further comprising:

58. The system of claim 57. further comprising fixing the first performance metric’s decision threshold to the second performance metric’s decision threshold, the fixing comprising:

(N) changing the selected decision threshold based on tire user input received in (M); and

(O) changing the other decision threshold to be equal to tire selected decision threshold.

59. The system of claim 31, wherein (D) comprises:

(D)(1) associating the test set data with a plurality of bins, wherein each of the plurality of bins is associated with a distinct subset of the plurality of test case data elements;

(D)(2) for each of the plurality of bins:

(D)(2)(a) computing a corresponding plurality of intermediate values of the first performance metric for that bin; and

(D)(2)(b) computing the corresponding first performance metric value for that bin based on the bin’s plurality of intermediate values of the first performance metric.

60. The system of claim 59, wherein (H) comprises:

(H)(1) for each of the plurality of bins, computing the corresponding updated first perfonnance metric value for that bin based on the bin’s plurality of intermediate values of the first performance metric.