WO2025196089A1

WO2025196089A1 - Processing of inclusion and exclusion criteria for research studies

Info

Publication number: WO2025196089A1
Application number: PCT/EP2025/057437
Authority: WO
Inventors: Rolf Christian JONASSON; Arnoldo Frigessi DI RATTALMA
Original assignee: Nordicrwe AS
Current assignee: Nordicrwe AS
Priority date: 2024-03-19
Filing date: 2025-03-18
Publication date: 2025-09-25
Anticipated expiration: 2026-09-19
Also published as: GB202403906D0

Abstract

A computer-implemented method of processing data encoding inclusion and exclusion criteria (l&EC) for selection of subjects for a research study is disclosed. The method comprises receiving first and second criteria data encoding respective sets of l&EC for selection of subjects for a research study. The first and second sets of l&EC comprise, respectively, a first plurality and a second plurality of constraints each relating to a respective attribute category of a plurality of attribute categories for subjects for inclusion in, or exclusion from, the research study. The method comprises receiving a first and a second set of attribute data comprising, for each of one or more individuals that comply, respectively, with the first or second set of l&EC, a respective attribute for each attribute category of the plurality of attribute categories. The method further comprises processing the attributes of the first and second sets of attribute data to determine a distance between the first and second sets of attribute data, and generating a distance value indicative of the determined distance, and storing the distance value in a memory for further processing.

Description

Processing of Inclusion and Exclusion Criteria for Research Studies

TECHNICAL FIELD

The present invention relates to computer-implement methods of processing data encoding inclusion and exclusion criteria (l&EC) for selection of subjects for a research study.

BACKGROUND

During the process of selecting suitable subjects for a research study, such as a clinical trial (e.g. a pharmaceutical trial for a new drug), inclusion and exclusion criteria (l&EC) are applied that impose constraints on various attributes of individuals to be included in or excluded from the trial. This helps ensure that any results obtained from the trial are statistically valid, e.g. by ensuring that subjects of an active arm and subjects of a control arm of a clinical trial are comparable, such that any significant differences in treatment effect can be attributed to the drug or method used by the active arm, rather than arising from selection differences between the arms.

As such, establishing l&EC for a research study, such as a clinical trial, may need to be carefully considered, with various factors (e.g. sample size, target population, etc.) being taken into account. This is not straightforwardly done using existing manual processes.

The present invention aims to provide improved methods for analysing and determining l&EC for research studies.

SUMMARY OF THE INVENTION

When viewed from a first aspect, the invention provides a computer-implemented method of processing data encoding inclusion and exclusion criteria (l&EC) for selection of subjects for a research study, comprising: receiving first criteria data encoding a first set of l&EC for selection of subjects for a research study, the first set of l&EC comprising a first plurality of constraints each relating to a respective attribute category of a plurality of attribute categories for subjects for inclusion in, or exclusion from, the research study; receiving a first set of attribute data comprising, for each of one or more individuals that comply with the first set of l&EC, a respective attribute for each attribute category of the plurality of attribute categories; receiving second criteria data encoding a second set of l&EC for selection of subjects for a research study, the second set of l&EC comprising a second plurality of constraints each relating to a respective attribute category of the plurality of attribute categories; receiving a second set of attribute data comprising, for each of one or more individuals that comply with the second set of l&EC, a respective attribute for each attribute category of the plurality of attribute categories; processing the attributes of the first and second sets of attribute data to determine a distance between the first and second sets of attribute data; and generating a distance value indicative of the determined distance and storing the distance value in a memory for further processing.

When viewed from a second aspect, the invention provides a computer system configured to process data encoding inclusion and exclusion criteria (l&EC) for selection of subjects for a research study according to any method disclosed herein.

When viewed from a third aspect, the invention provides computer software, and optionally a non- transitory computer-readable medium storing the same, comprising instructions which, when executed on a computer system, cause the computer system to process data encoding inclusion and exclusion criteria (l&EC) for selection of subjects for a research study according to any method disclosed herein.

Thus it will be seen that embodiments of the invention perform a computer-implemented comparison of different sets of l&EC in order to generate a quantitative measure of the distance between those sets (i.e. how similar or dissimilar the two sets are). Methods disclosed herein may advantageously allow a comparison to be performed even where l&ECs impose constraints of various different types (e.g. binary variables, discrete variables, continuous variables, etc.). Certain embodiments can use this distance measure — optionally in combination with sample size and/or a confidence interval length — to select an optimised set of l&EC for a research study, as described in more detail below. This may be done before the research study is performed. However, other embodiments may use this distance measure to analyse or interpret a research study after it has been performed (i.e. after the criteria for selection of subjects have already been used to select subjects).

The research study may be a randomized controlled trial (RCT). It may be a study of an intervention or treatment. In some embodiments it is a clinical trial.

Methods disclosed herein may, for example, be helpful in various aspects of drug development.

One such aspect is where a drug has been successfully tested in a phase two study against standard treatment, using a particular set of l&EC, and the developer wishes to design the next trial so that the drug can be approved for the broadest possible patient group for which the new drug is better than standard treatment, or better than any other treatment. The developer therefore wishes to define a less restrictive l&EC, relaxing or changing the original l&EC used in the phase two study, which however also confirms the successful comparison of the phase two study. Methods disclosed herein can provide a useful approach for assisting in this process by providing a quantitative measure of the distance between the original l&EC used in the phase two study with prospective l&EC for the third phase. The same is true for earlier stages of drug development, e.g. when a company is entering a completely new therapeutic area in order to design the clinical development program when starting clinical trials.

Another such aspect is where only an active arm for a clinical trial (e.g. a single-arm trial (SAT)) has been performed, and there is a need to compare the efficacy of the drug against standard treatment, e.g. an external (i.e. synthetic) control arm (ECA). Ideally, the l&EC for an external control arm should ideally be as similar to (i.e. as small a distance from) the l&EC used in the active arm as possible in order to ensure validity of the statistical comparison. However, it may not be possible to use an identical l&EC for the external control arm - e.g. where that used for the active arm includes a specific constraint which is not routinely tested in a standard clinical setting. Methods disclosed herein advantageously allow l&EC to be quantitatively compared when constructing an external control arm from registry data in order to perform a statistical comparison.

A further aspect is for interpreting clinical trial results once a randomized controlled trial (RCT) result is available. The quantitative distance measure may aid in contextualizing results from RCTs by facilitating the progressive adjusting of l&ECs for examining how modifications to l&ECs impact the population size and efficacy estimates. This may help support pharmaceutical companies and regulatory bodies in assessing the representativeness of clinical trial populations in comparison to populations in clinical practice — e.g. for supporting Health Technology Assessment (HTA) decisionmaking, often referred to as the efficacy-effectiveness gap.

In some embodiments, the first and second sets of attribute data each encode a respective array or list storing attribute categories along a first dimension and individuals along a second dimension.

The research study may be a trial of a drug (i.e. pharmaceutical) or other medical therapy. It may be a trial for human therapy or for veterinary therapy. The individuals may be humans or animals. Each individual may be an individual person or animal. However, in some examples, the same individual may be in both the first and second sets of attribute data.

The attribute categories may include a quantitative category and/or a categorical category and/or a qualitative category. They may include binary and/or multinomial categories. They may, for example, include any one or more of: age, height, weight, having or having had a particular disease, having had a particular medical procedure, taking or having taken a particular medicine, having or having had a result of a particular bio-medical measurement, etc.

Each attribute category may be associated with a set of attribute values (e.g. {true, false}, or the set of positive integers). An attribute for an attribute category may be a value selected from this set of attribute values (e.g. “true”, or “64”). A constraint relating to an attribute category may define a subset of attribute values for inclusion or exclusion (e.g. “true”, or greater than 18). The constraint may be encoded as a set of values (e.g. as a list), or as an upper and/or lower threshold.

The first set of l&EC may have a same number of constraints as the second set of l&EC. The first set of l&EC and the second set of l&EC may each comprise constraints relating to each of the plurality of attribute categories. However, the first set of l&EC and/or the second set of l&EC may also include one or more further constraints relating to additional attribute categories not included in the plurality of attribute categories.

In some embodiments, the step of processing the attributes of the first and second sets of attribute data to determine a distance between the first and second sets of attribute data comprises: performing a normalisation operation on each of the attributes of first set of attribute data in order to obtain a first set of normalised attribute data; performing a normalisation operation on each of the attributes of second set of attribute data in order to obtain a second set of normalised attribute data; and processing the normalised attributes of the first and second sets of normalised attribute data in order to determine the distance between the first and second sets of attribute data.

Such embodiments may advantageously allow attributes of different types (e.g. binary, multinomial, continuous) to be put onto the same scale, and thus allow each of these to be taken into account when comparing different sets of l&EC. The normalisation operation may comprise mapping each attribute of the set of attribute data to a value on a common scale (e.g. zero to one). For at least one or more, or all, of the attributes, the normalisation operation may be a linear normalisation operation. For at least one or more, or all, of the attributes, the normalisation operation may comprise mapping an attribute in the set of attribute data to a value on the common scale (e.g. zero to one), wherein a first or minimum value of the attribute in the set of attribute data for the attribute category is mapped to a smallest value on the scale, and a second or maximum value of the attribute in the set of attribute data for the attribute category is mapped to a largest value on the scale. For quantified attribute categories, values between the minimum value and the maximum value may be mapped on a linear scale between the smallest and largest values on the scale.

In some embodiments, the method further comprises: performing a dimensionality reduction operation on the first set of normalised attribute data in order to obtain a first dimensionally reduced set of normalised attribute data; and performing a dimensionality reduction operation on the second set of normalised attribute data in order to obtain a second dimensionally reduced set of normalised attribute data.

Such embodiments may advantageously improve processing efficiency by reducing the size of the data being analysed while retaining sufficient information for an accurate comparison between l&EC. This is particularly important in situations where large amounts of data require processing, which is the case in many embodiments where large volumes of patient data are used. It may also be useful in situations where the attributes are measured with measurement error, which may be reduced by such dimension reduction. In some embodiments, performing the dimensionality reduction operation comprises performing a principle component analysis.

In some embodiments, the method further comprises: computing a first number, k1, of principle components of the first dimensionally reduced set of normalised attribute data having a variability metric that is greater than or equal to a predetermined proportion of a variability metric calculated for the first dimensionally reduced set of normalised attribute data; computing a second number, k2, of principle components of the second dimensionally reduced set of normalised attribute data having a variability metric that is greater than or equal to a predetermined proportion of a variance metric calculated for the second dimensionally reduced set of normalised attribute data; determining a maximum value, kmax, of k1 and /c2; selecting a first subset of the first dimensionally reduced set of normalised attribute data, the first subset comprising kmax principle components of the first dimensionally reduced set of normalised attribute data; and selecting a second subset of kmax principle components of the second dimensionally reduced set of normalised attribute data, the second subset comprising kmax principle components of the second dimensionally reduced set of normalised attribute data.

Such embodiments may further advantageously improve processing efficiency by reducing the size of the data being operated on while retaining sufficient information for an accurate comparison. The same predetermined proportion may be used for both sets. In some embodiments, the or each predetermined proportion is 75%, though it may be any appropriate value e.g. 50%, 80%, 90%, or 100% in other embodiments. The variability metric may be a variance, a standard deviation, or any other appropriate statistical measure of the variability of the respective set of data.

In some embodiments, the method further comprises: determining a first centroid of the first subset (i.e. in the reduced max-dimensional space); determining a second centroid of the second subset (i.e. in the reduced max-dimensional space); and computing a Euclidean distance between the first and second centroids, wherein the Euclidean distance measures the distance between the first and second sets of attribute data.

The computed Euclidean distance, or a scaled or modified version thereof, may provide the distance value.

In some embodiments, the method further comprises: receiving a third set of attribute data comprising, for each of one or more research study subjects included in an active arm of a research study, a respective attribute for each attribute category of the plurality of attribute categories; performing a propensity score matching operation between the third set of attribute data and the first set of attribute data in order to obtain a first subset of the first set of attribute data; and performing a propensity score matching operation between the third set of attribute data and the second set of attribute data in order to obtain a second subset of the second set of attribute data.

The active arm may be an active arm of a research study (e.g. a clinical trial) performed using research study subjects satisfying the first set of l&EC.

The first and second subsets may each include a number of individuals equal to a number of individuals included in the third set of attribute data. The first and second sets of attribute data may be received from one or more databases of patient data, e.g. from a computer memory or from a server over a computer network. The one or more databases may include electronic health records (EHRs), registries, etc. The first and second sets of attribute data may comprise attribute data for subjects of a prospective external control arm, obtained from one or more databases of patient data, for the active arm of the research study from which the third set of attribute data is obtained.

In some embodiments, the method further comprises: computing an intersection between the first subset and the second subset; computing a cardinality of the intersection in order to obtain a cardinality value indicative of the cardinality; and determining the distance between the first and second sets of attribute data from the cardinality value.

In some embodiments, determining the distance between the first and second sets of attribute data comprises: dividing the cardinality value by a divisor derived from a number of individuals included in the third set of attribute data; and subtracting the result from a predetermined number.

The result of the subtraction may be the distance. The divisor may be proportional to a number of individuals included in the third set of attribute data. The divisor may be equal to the number of individuals included in the third set of attribute data. The predetermined number may be equal to one.

In some embodiments, the first and second sets of attribute data are received from one or more databases of patient data. The first and second sets of attribute data may be received by accessing the database(s), e.g. from a computer memory or from a server over a computer network. In some embodiments, the method comprises: receiving a first set of research study data for an active arm of a first research study performed using research study subjects satisfying the first set of l&EC; accessing one or more databases of patient data in order to obtain a second set of research study data, for a control arm of a research study, comprising individuals satisfying the second set of l&EC; computing a sample size of the second set of research study data, and storing a sample size value indicative of the sample size in the memory; and computing an overall quality rating for the second set of l&EC in dependence upon at least the stored distance value and the stored sample size value, and storing the overall quality rating in the memory.

In some embodiments, the method comprises: receiving a first set of research study data for an active arm of a first research study performed using research study subjects satisfying the first set of l&EC; accessing one or more databases of patient data in order to obtain a second set of research study data, for a control arm of a research study, comprising individuals satisfying the second set of l&EC; computing an estimate of a length of a confidence interval of a treatment effect of the first research study based on the first set of research study data and the second set of research study data, and storing a confidence-interval length estimate value indicative of the estimated confidence interval length in the memory; and computing an overall quality rating for the second set of l&EC in dependence upon at least the stored distance value and the stored confidence-interval length estimate value, and storing the overall quality rating in the memory.

In some embodiments, the method comprises: receiving a first set of research study data for an active arm of a first research study performed using research study subjects satisfying the first set of l&EC; accessing one or more databases of patient data in order to obtain a second set of research study data, for a control arm of a research study, comprising individuals satisfying the second set of l&EC; computing a sample size of the second set of research study data, and storing a sample size value indicative of the sample size in the memory; computing an estimate of a length of a confidence interval of a treatment effect of the first research study based on the first set of research study data and the second set of research study data, and storing a confidence-interval length estimate value indicative of the estimated confidence interval length in the memory; and computing an overall quality rating for the second set of l&EC in dependence upon the stored distance value, the stored sample size value and the stored confidence-interval length estimate value, and storing the overall quality rating in the memory.

Such embodiments may advantageously allow a quantitative measure of quality (the overall quality rating) for a second set of l&EC, relative to a first set of l&EC (e.g. one used for an earlier stage of a trial, or in an active arm of a clinical study for which an external control arm is being designed), while taking into account additional factors which may affect this, in particular sample size and treatment-effect confidence interval length.

The confidence-interval length estimate may comprise an estimate of a length of a 95% or 99% (or any other percentage value) confidence interval of treatment effect.

In some embodiments, computing the overall quality rating for the second set of l&EC comprises combining the stored distance value, the stored sample size value and the stored confidenceinterval length estimate value. The combining may be linear. Computing the overall quality rating may further comprising combining one or more additional metrics.

In some embodiments, the method comprises receiving one or more weighting values, each weighting value being associated with the distance value, the sample size value, or the confidenceinterval length estimate value; and wherein computing the overall quality rating for the second set of l&EC comprises: weighting the stored distance value, the stored sample size value and the stored confidence-interval length estimate value according to the weighting value(s) associated therewith; and combining (e.g. linearly) the weighted distance value, the weighted sample size value and the weighted confidence-interval length estimate value.

Such embodiments may advantageously allow a level of flexibility and customisation to be provided to a user, since the use of weighting values may allow subjective considerations of importance for different factors, when comparing sets of l&EC, to be taken into account while computing the overall quality rating.

Combining a plurality of values (e.g. the stored distance value, the stored sample size value and the stored confidence-interval length estimate value) may comprise adding and/or subtracting each of the values. The values for which a smaller value is more desirable (e.g. distance and confidenceinterval length) may be subtracted, and the values for which a larger value is more desirable (e.g. sample size) may be added.

The weighting value(s) may be received from an input device e.g. a touchscreen, keyboard, mouse, etc. In some embodiments the method comprises receiving a respective weighting value associated with each of the distance value, the sample size value, and the confidence-interval length estimate. The method may comprise presenting a user with a user interface, displayed on a display, which enables the user to input or select the one or more weighting values. The user interface may for example display one or more sliders for selecting weighting values, or one or more input fields where a user may input data.

In some embodiments, the method comprises: storing the second set of l&EC in the memory as a current second set of l&EC; and in each of one or more iterations: modifying a constraint included in the current second set of l&EC in order to obtain a modified second set of l&EC, and storing the modified second set of l&EC as the current second set of l&EC in the memory; receiving a revised second set of attribute data comprising, for each of one or more individuals that comply with the current second set of l&EC, a respective attribute for each attribute category of the plurality of attribute categories; accessing the one or more databases of patient data in order to obtain a revised second set of research study data, for a control arm of a research study, comprising individuals satisfying the current second set of l&EC; processing the attributes of the first set of attribute data and the revised second set of attribute data to determine a distance between the first set of attribute data and the revised second set of attribute data, and storing a current distance value indicative of the determined distance in the memory; computing a sample size of the revised second set of research study data, and storing a current sample size value indicative of the sample size in the memory; computing an estimate of a length of a confidence interval of a treatment effect of the first research study based on the first set of research study data and the revised second set of research study data, and storing a new confidence-interval length estimate value indicative of the estimated confidence interval length in the memory; and computing a current overall quality rating for the current second set of l&EC based on the stored current distance value, the stored current sample size value and the stored current confidence-interval length estimate value, and storing the current overall quality rating in the memory.

Such embodiments may advantageously allow systematic exploration of possible sets of l&EC, and comparison therebetween, thus providing a useful tool for l&EC selection. The method may comprise an optimisation process. Some embodiments may iteratively modify constraints to determine an optimised overall quality rating. This optimised overall quality rating may be used to determine an optimised l&EC for selection of subjects for a research study (i.e. being the current second set of l&EC in a final iteration of the one or more iterations). Modifying the constraint in each iteration may comprise relaxing the constraint by a single discretised unit thereof, e.g. one year of age, a single cancer stage, etc. Modifying the constraint in each iteration may comprise selecting a constraint in the current second set of l&EC at random, and modifying the selected constraint.

In some embodiments, modifying the constraint in each iteration comprises: for each constraint included in the current second set of l&EC: modifying the constraint by a single discretised unit thereof in order to obtain a temporary modified second set of l&EC; computing and storing in the memory a temporary overall quality rating for the temporary modified second set of l&EC; and selecting one of the temporary modified second sets of l&EC to use as the modified second set of l&EC in dependence on the temporary stored overall quality ratings.

Such embodiments may advantageously improve processing efficiency by causing the systematic exploration of possible sets of l&EC to move towards l&EC having better (e.g. higher) overall quality ratings (i.e. optimising on quality rating), and avoid unnecessary computation for l&EC sets which have worse overall quality ratings.

In some embodiments, selecting one of the temporary modified second sets of l&EC comprises selecting the temporary modified second set of l&EC having a best (e.g. highest) overall quality rating. The temporary overall quality ratings may be all stored in memory and compared directly, or a temporary quality rating currently stored in the memory may be updated in the event that the new temporary quality rating is better (e.g. higher) than the temporary quality rating currently stored in memory.

In some embodiments, selecting one of the temporary modified second sets of l&EC comprises selecting one of the temporary modified second sets of l&EC according to a calculated probability distribution, wherein the probability of a temporary modified second set of l&EC being selected is dependent on the overall quality value associated therewith. Such embodiments may advantageously ensure that the systematic exploration of possible sets of l&EC tends towards those having better overall quality ratings, but minimises the possibility of finding a local minimum by ensuring that more sets of l&EC are explored.

In such embodiments, the temporary modified second set of l&EC having a best (e.g. highest) overall quality rating may have the highest probability of being selected. Similarly, the temporary modified set of l&EC with the worst (e.g. lowest) overall quality rating may have the lowest probability of being selected.

Each modified second set of l&EC, and the overall quality rating associated therewith, may be stored in the memory for later access. The distance value, sample size value and confidence- interval length estimate value associated with each modified second set of l&EC may also be stored in the memory for later access.

The method may comprise stopping the one or more iterations once a stop condition is met. This may be after a predetermined maximum number of iterations and/or when the overall quality rating has been optimised (e.g. is maximised or locally maximised, or the rate of improvement has slowed below a threshold) and/or when all possible modifications have been made.

In some embodiments, the method comprises: receiving a first set of research study data for an active arm of a first research study performed using research study subjects satisfying the first set of l&EC; receiving data encoding one or more further second sets of l&EC for respective research studies so as to receive data encoding a plurality of second sets of l&EC, each of the plurality of second sets of l&EC comprising a respective second plurality of constraints each relating to a respective attribute category of the plurality of attribute categories; for each of the plurality of second sets of l&EC: receiving a respective second set of attribute data comprising, for each of one or more individuals that comply with the second set of l&EC, a respective attribute for each attribute category of the plurality of attribute categories; accessing one or more databases of patient data to obtain a respective second set of research study data, for a control arm of a research study, comprising individuals which satisfying the second set of l&EC; and processing the attributes of the first set of attribute data and the second set of attribute data to determine a distance between the first and second sets of attribute data, and storing a respective distance value indicative of the determined distance in the memory; computing a sample size of the second set of research study data, and storing a respective sample size value indicative of the sample size in the memory; and computing an estimate of a length of a confidence interval of a treatment effect of the first research study based on the first set of research study data and the second set of research study data, and storing a respective confidence-interval length estimate value indicative of the estimated confidence interval length in the memory; and computing an overall quality rating for each second set of l&EC based on the respective stored distance value, the respective stored sample size value and the respective stored confidence-interval length estimate value associated therewith, and storing the overall quality ratings in the memory.

In some embodiments, the method comprises: performing a normalisation operation on each stored distance value in order to obtain and store in the memory a respective normalised distance value; performing a normalisation operation on each stored sample size value in order to obtain and store in the memory a respective normalised sample size value; performing a normalisation operation on each stored confidence-interval length estimate value in order to obtain and store in the memory a respective normalised confidence-interval length estimate value; and computing the respective overall quality rating for each second set of l&EC based on the respective normalised distance value, the respective normalised sample size value and the respective normalised confidence-interval length estimate value associated therewith.

The normalisation operation may comprise mapping each stored value to a value on the same scale - e.g. zero to one. The normalisation operation may be a linear normalisation operation. The normalisation operation may comprise mapping the stored value to a value on a predetermined scale (e.g. zero to one), wherein the minimum value of the stored values of the same type is mapped to zero, the maximum value of the stored values of the same type is mapped to one, and values between the minimum value and the maximum value are mapped on a linear scale between the smallest and largest values on the scale. Here, the term “stored values of the same type” is used to refer to the stored distance values when the normalisation operation is performed on a distance value, to the stored sample size values when the normalisation operation is performed on a sample size value, and to the stored confidence-interval length estimate values when the normalisation operation is performed on a confidence-interval length estimate value.

In some embodiments, computing the respective overall quality rating for each second set of l&EC comprises combining (e.g. linearly) the respective normalised distance value, the respective normalised sample size value and the respective normalised confidence-interval length estimate value.

In some embodiments, the method comprises: receiving one or more weighting values, each weighting value being associated with the distance value, the sample size value, or the confidence-interval length estimate value; and wherein computing the overall quality rating for each second set of l&EC comprises: weighting the stored normalised distance value, the stored normalised sample size value and the stored normalised confidence-interval length estimate value according to the weighting value(s) associated therewith; and combining (e.g. linearly) the weighted normalised distance value, the weighted normalised sample size value and the weighted normalised confidence-interval length estimate value.

Such embodiments may advantageously allow a level of flexibility and customisation to be provided to a user, since the use of weighting values may allow subjective considerations of importance for different factors, when comparing sets of l&EC, to be taken into account while computing the overall quality rating. Combining a plurality of values (e.g. the weighted normalised distance value, the weighted normalised sample size value and the weighted normalised confidence-interval length estimate value) may comprise adding and/or subtracting each of the values. The values for which a smaller value is more desirable may be subtracted, and the values for which a larger value is more desirable may be added.

In some embodiments, the method comprises identifying an optimal set of one or more of the plurality of second sets of l&EC in dependence on the stored overall quality ratings. Such embodiments may advantageously enable the method to automatically determine one or more best sets of l&EC, and present these to a user.

Identifying the optimal set of one or more second sets of l&EC may comprise identifying some one or more of the plurality of second sets of l&EC having the best stored overall quality ratings. The best overall quality ratings may be the lowest overall quality ratings.

In some embodiments, the method comprises ranking the plurality of second sets of l&EC according to their respective stored overall quality ratings.

In some embodiments, the method comprises displaying, on a display, a list of one or more of the plurality of second sets of l&EC ranked according to their respective stored overall quality ratings. Such embodiments may advantageously provide a useful output for a user which makes it easy to compare and contrast different sets of l&EC while ensuring that the best sets of l&EC identified are present to a user first, thus improving overall ease of use.

The method may comprise displaying, for each of the second sets of l&EC of the list, the respective stored distance value, the respective stored sample size value, and the respective stored confidence-interval length estimate value associated therewith.

The displayed list may be for allowing a user to select one or more of the second sets of l&EC of the list for use in a subsequent stage of a research study following the research study performed using research study subjects satisfying the first set of l&EC. The displayed list may be further for constructing an external control arm for statistical comparison with the subsequent stage of the research study, the external control arm comprising the second set(s) of research study data associated with the selected one or more second sets of l&EC.

The displayed list may be for allowing a user to select one or more of the second sets of l&EC of the list for constructing an external control arm for statistical comparison with the first set of research study data, the external control arm comprising the second set(s) of research study data associated with the selected one or more second sets of l&EC.

While in some embodiments, the research study is a clinical trial, in other embodiments it may be a different type of research study — e.g. in any field of natural or social sciences. It may, for example, be an agricultural study, an economics study, a psychology study, a political-science study, etc. The treatment effect may be any outcome of interest for the research study.

It will be understood that a computer system as disclosed herein may comprise one or more processors and memory storing software for execution by the one or more processors for performing any of the operations disclosed herein. The computer system may be a single computer (e.g. a workstation, PC, laptop or server) or may be distributed.

Features of any aspect or embodiment described herein may, wherever appropriate, be applied to any other aspect or embodiment described herein. Where reference is made to different embodiments or sets of embodiments, it should be understood that these are not necessarily distinct but may overlap.

BRIEF DESCRIPTION OF THE DRAWINGS

Certain embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a system for processing l&EC in accordance with an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a computer-implemented method of processing l&EC in order to determine a distance (i.e. dissimilarity) between two sets of l&EC according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a computer-implemented method of processing l&EC in order to determine a distance (i.e. dissimilarity) between two sets of l&EC according to another embodiment of the present invention;

FIG. 4 is a flowchart illustrating a computer-implemented method of processing l&EC in order to evaluate and compare one or more different sets of l&EC;

FIG. 5 is a flowchart illustrating a computer-implemented method of processing l&EC in order to systematically explore possible sets of l&EC;

FIG. 6 is a graph of a model generating simulated data;

FIG. 7 is a chart of control size, confidence interval length & distance of relaxation for data in a first example simulation; FIG. 8 shows three 2D plots relating control size, confidence interval length & distance of relaxation, for the first example simulation; and

FIG. 9 is a 3D plot of control size, confidence interval length & distance of relaxation for the first example simulation.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 shows a system 1 for processing l&EC in accordance with some embodiments. The system 1 comprises a user device 2, a first server 14, a second server 14’ and a network 22. Although two servers 14, 14’ are shown in FIG. 1 , the system 1 may comprise any appropriate number of servers 14, 14’. The user device 2 and the servers 14, 14’ are each connected to the network 22 via a wired or wireless connection, allowing the user device 2 to communicate with the servers 14, 14’, and vice versa.

The user device 2 comprises a display device 4, an input device 5, a processor 6 (e.g. a central processing unit, a system-on-chip, a field-programmable gate array, etc.), a volatile memory 8 (e.g. random-access memory), a non-volatile memory 10 (e.g. a hard-disk drive, a solid-state drive, flash memory, etc.) and a system bus 12. The display device 4, input device 5, processor s, volatile memory 8 and non-volatile memory 10 are each connected to the system bus 12, allow transfer of data between one another. The input device 5 may be any appropriate input device for allowing user inputs to the user device 2, e.g. a keyboard, a mouse, a touchscreen (in which case the display device 4 and the input device 5 may be the same component), etc.

The first server 14 comprises a processor 16, a volatile memory 18 and a non-volatile memory 20. Similarly, the second server 14’ comprises a processor 16’, a volatile memory 18’ and a non-volatile memory 20’. The respective processors 16, 16’ are connected to and thus able to transfer data to/from the respective memories 18, 18’, 20, 20’. In this embodiment, the two servers 14, 14’ each act as a database storing real-world patient data in the non-volatile memories 20, 20’ thereof- e.g. as electronic health records (EHRs), patient registries, etc.

The user device 2 transmits data requests over the network 22 to one or more of the servers 14, 14’, in response to which the requested servers 14, 14’ transmit some or all of the patient data stored in the non-volatile memories 20, 20’ thereof to the user device 2 over the network 22, in dependence on one or more parameters included in the data requests.

FIG. 2 is a flowchart illustrating a method 30 of processing l&EC in order to determine a distance between two sets of l&EC. In this embodiment, the method 30 is carried out by the processor 6 of the user device 2 shown in FIG. 1 , executing software instructions stored on the NVM 10. Although the following methods will be described with reference to clinical trials, they may also be applied to other types of research study — e.g. for selecting subjects in a customer satisfaction study. At step 32, a first set of l&EC L0 is received. This is represented by a vector of constraints each relating to a respective attribute category of subjects for inclusion in, or exclusion from, e.g. a clinical trial - e.g. age, height, weight, having or having had a particular disease, having had a particular medical procedure, taking or having taken a particular medicine, having or having had a result of a particular bio-medical measurement, etc. For example, for the attribute category ‘age’, the applicable constraint in the first set of l&EC may specify ‘greater than 40’, ‘less than 60’, ‘between 20 and 30’, or any other appropriate quantitative constraint. The first set of l&EC L0 is represented by a list or array encoding each constraint in a numerical (e.g. binary or multinomial) form. The first set of l&EC L0 in this embodiment is received through manual input by a user using the input device 5 (e.g. by manually selecting attribute categories and inputting constraints relating thereto), though it may be received in any other appropriate manner.

At step 34, a first set of attribute data M0 is received. The first set of attribute data M0 includes attributes of one or more individuals which satisfy the first set of l&EC L0. In this embodiment, the first set of attribute data M0 is received from one or both of the servers 14, 14’, acting as databases of patient data, over the network 22. The first set of attribute data M0 is represented by a two- dimensional array or list storing attribute categories along a first dimension, and individuals along a second dimension. The user device 2 transmits data indicative of the first set of l&EC L0 to the server(s) 14, 14’ e.g. in a database query, in response to which the server(s) 14, 14’ determine which patient(s) for which data is stored therein meet the constraints specified by the first set of l&EC L0, and transmits data indicative of the first set of attribute data M0 to the user device 2.

At step 36, a second set of l&EC L1 is received. The second set of l&EC L1 is a vector of constraints, like the first set L0, though it may include more or fewer constraints relating to the same or different attribute categories. The second set of l&EC L1 may be received through manual input by a user using the input device 5, generated by the user device 2 and stored in memory, received from a data structure stored on the NVM 10, received over the network 22, or received in any other appropriate manner.

At step 38, a second set of attribute data M1 is received. The second set of attribute data M1 is represented in the same way as the attribute data M0, and received from the server(s) 14, 14’ in the same manner as described previously, but contains attributes of individuals which meet the second set of l&EC L1 rather than the first set L0.

At step 40, the first set of attribute data M0 is normalised to obtain a first normalised set of attribute data NO. Each attribute included in the first set of attribute data M0 is normalised by mapping each attribute to a value on a common scale, in this case zero to one. The highest value for a given attribute category in the first set of attribute data M0 is mapped to one, the smallest value for a given attribute category is mapped to zero, and the other values are mapped on a linear scale between the smallest and largest values on the scale. At step 42, the second set of attribute data M1 is normalised to obtain a second normalised set of attribute data N1 according to this same process.

At step 44, a principle component analysis is performed on the first normalised set of attribute data NO in order to obtain a first dimensionally reduced set of normalised attribute data PO having n principle components. At step 46, the first k1 principle components of PO are computed such that the selected k1 principle components have a variability metric, in this case a variance, which is greater than 75% of the same metric computed for all n principle components of PO.

At step 48, a principle component analysis is performed on the second normalised set of attribute data N1 in order to obtain a second dimensionally reduced set of normalised attribute data P1 having n principle components. At step 50, the first k2 principle components of P1 are computed such that the selected k2 principle components have a variability metric, in this case a variance, which is greater than 75% of the same metric computer for all n principle components of P1.

Other dimensionality reduction processes, e.g. matrix factorisation, singular value decomposition, etc., may be used at steps 44 and 48, instead of the principle component analysis used in this embodiment. Other predetermined proportions of the variability metric may be used at steps 46 and 50, dependent on use, e.g. 50%, 60%, 80%, 90%, etc, and all n components may be selected in some embodiments, i.e. k1 = n and/or k2 = n.

At step 52, the maximum value of k1 and k2 is determined, and set as kmax (e.g. if k1 < k2, kmax = k2). At step 54, a first subset SO of the first dimensionally reduced set of normalised attribute data P0 is selected, the first subset SO including the first kmax principle components of P0. Similarly, at step 56, a second subset S1 of the second dimensionally reduced set of normalised attribute data P1 is selected, the second subset S1 including the first kmax principle components of P1.

At step 58, a first centroid CO is computed for the first subset SO in the reduced max-dimensional space. Similarly, at step 60, a second centroid C1 is computed for the second subset S1 in the reduced max-dimensional space. At step 62, the Euclidian distance d between the two centroids CO, C1 is computed. This Euclidean distance d is then stored in the volatile memory 8 and/or the non-volatile 10 as a distance value DV, at step 64. This distance value D\/ is indicative, inversely, of the similarity between the first set of attribute data M0 and the second set of attribute data M1, and thus also indicative, inversely, of the similarity between the two sets of l&EC L0, L1.

FIG. 3 is a flowchart illustrating a method 70 of processing l&EC in order to determine a distance between two sets of l&EC according to another embodiment. In this embodiment, the method 70 is carried out by the processor 6 of the user device 2 shown in FIG. 1 , executing software instructions stored on the NVM 10. At step 72, a first set of l&EC LO is received. At step 74, a first set of attribute data MO is received which satisfies the first set of l&EC LO. At step 76, a second set of l&EC L1 is received. At step 78, a second set of attribute data M1 is received which satisfies the second set of l&EC L1. Steps 72 to 78 are performed in substantially the same manner as described previously with reference to steps 32 to 38 of FIG. 2.

At step 80, a third set of attribute data M2 is received for an active arm of a clinical trial including na subjects. The third set of attribute data M2 is represented by a two-dimensional array or list storing attribute categories along a first dimension, and individuals along a second dimension, like the first and second sets of attribute data MO, M1. Each of the individuals included in the third set of attribute data M2 were included in the active arm of a clinical trial which has already been conducted. In this embodiment, each individual included in the third set of attribute data M2 satisfies the first set of l&EC LO, which was used for the active arm of the clinical trial. The third set of attribute data M2 may be received via the input device 5, via the network 22, from the NVM 10 on which the third set of attribute data M2 is stored, or any other appropriate manner.

At step 82, a propensity score matching operation is performed between the third set of attribute data M2 and the first set of attribute data MO in order to obtain a first subset SO of the first set of attribute data MO, the first subset SO including attribute data for na individuals included in the first set of attribute data MO which most closely match the individuals in the third set of attribute data M2. At step 84, a propensity score matching operation is performed between the third set of attribute data M2 and the second set of attribute data M1 in much the same manner as in step 82, in order to obtain a second subset S1 of the second set of attribute data M1.

At step 86, the intersection SO D S1 between the first subset SO and the second subset S1 is computed. At step 88, the cardinality | SO D S1 | of the intersection SO D S1 is computed. At step 90, a distance value DV indicative of a distance between the first set of attribute data MO and the second set of attribute data M1 (and thus also indicative of distance between the first set of l&EC LO and the second set of l&EC N1) is computed according to the equation: DV = -f na) | SO’ D S1’ |. In other words, the distance value DV is computed as: one minus the calculated cardinality divided by the number of subjects included in the active arm of the clinical trial (i.e. the number of individuals included in the third set of attribute data M2). The distance value DV is then stored in the volatile memory 8 and/or the non-volatile memory 10.

FIG. 4 is a flowchart illustrating a method 100 of processing l&EC in order to evaluate and compare one or more different sets of l&EC. In this embodiment, the method 100 is carried out by the processor 6 of the user device 2 shown in FIG. 1 , executing software instructions stored on the NVM 10. At step 102, a first set of l&EC LO is received. At step 104, a first set of attribute data MO is received which satisfies the first set of l&EC LO. Steps 102 and 104 are performed in substantially the same manner as described previously with reference to steps 32 and 34 of FIG. 2.

At step 106, a first set of clinical trial data TO for an active arm of a clinical trial performed using subjects that satisfy the first set of l&EC LO is received. The first set of clinical trial data TO is represented by a two-dimensional array or list storing clinical trial data (e.g. biological measurements, symptom reduction, etc.) along a first dimension, and clinical trial subjects along a second dimension.

At optional step 108, one or more weighting values are received. The weighting values in this embodiment are manually input by a user using the input device 5, e.g. via a graphical user interface displayed on the display 4 and one or more input devices (touchscreen, keyboard, mouse, etc.), though these may be received in any appropriate manner. The user interface may for example display one or more sliders for selecting weighting values, or one or more input fields where a user may input data. In this embodiment, a respective weighting value is received for each of the following categories: sample size, confidence-interval length estimate and distance, as is described in more detail below, though in other embodiments a respective weighting value may be received for any one or more of these categories (or indeed for none of these categories when optional step 108 is not carried out).

At step 110, a set X including one or more second sets of l&EC L1i is received. Each set of l&EC L1i is a vector of constraints, as described previously, thus making the setXa two-dimensional array or list of sets of l&EC L1i where it includes a plurality of second sets of l&EC L1i, or a vector where it includes only one second set of l&EC L1i.

At step 112, the next second set of l&EC L1i is selected from the setX. In the first iteration of the method 100, this is the first set of l&EC L1i included in the setX. At step 114, a second set of attribute data M1i is received. The second set of attribute data M1i includes attributes of one or more individuals which satisfy the selected second set of l&EC L1i, and is received in the same manner as described previously with reference to step 34 of FIG. 2.

At step 116, a second set of clinical trial data TH for an external (i.e. synthetic) control arm (ECA) of a clinical trial is received which includes clinical trial data for one or more individuals satisfying the selected second set of l&EC L1i. The second set of clinical trial data T1i is received from one or more both of the servers 14, 14’, acting as databases of patient data, over the network 22. The second set of clinical trial data T1i is represented by a two-dimensional array or list storing clinical trial data along a first dimension, and individuals along a second dimension. The user device 2 transmits data indicative of the selected second set of l&EC L1i to the server(s) 14, 14’ e.g. in a database query, in response to which the server(s) 14, 14’ determine which patient(s) for which data is stored therein meet the constraints specified by the selected second set of l&EC L1i, and transmits relevant data for those individuals for constructing an external control arm for the active arm of the clinical trial which produced the first set of clinical trial data TO.

At step 118, a distance value DVi indicative of a distance between the first set of attribute data MO and the second set of attribute data M1i is calculated and stored in the volatile memory 8 or the non-volatile memory 10, associated with the selected second set of l&EC L1i. The distance value DVi is calculated at step 118 according to the method 30 shown in FIG. 2, or the method 70 shown in FIG. 3 (albeit using the already received sets of l&EC LO & L1i and attribute data MO, M1i, rather than receiving new versions of these at steps 32 to 38 and steps 72 to 78).

At step 120, a sample size SS/ of the second set of clinical trial data TH is computed and stored in the volatile memory 8 or the non-volatile memory 10, associated with the selected second set of l&EC L1i. Calculation of the sample size SS/ at step 120 includes determining the number of individuals included in the second set of clinical trial data T1i.

At step 122, the length of a confidence interval of a treatment effect from the first set of clinical trial data TO and the second set of clinical trial data T1i is estimated, and the resultant confidenceinterval length estimate ILi is stored in the volatile memory 8 or the non-volatile memory 10, associated with the selected second set of l&EC L1i. In this embodiment, this involves estimating the length of a confidence interval for, e.g., drug efficacy, using the first set of clinical trial data TO as the active arm, and the second set of clinical trial data T1i as the control arm. It may be any other appropriate confidence-interval length estimate in other embodiments, depending on the nature of the first clinical trial data TO. It may include the use of inverse propensity scores.

At step 124, it is determined whether there are any second sets of l&EC L1i remaining in the setX which have not yet been selected at step 112. If any second sets of l&EC L1i remain in the setX, then the method 100 proceeds back to step 112, where the next second set of l&EC L1i in the setX is selected and steps 114 to 124 are repeated in respect of the newly selected set L1i. If no second sets of l&EC L1i remain in the setX, then the method 100 proceeds to step 126. This will occur in the first iteration of the method 100 where the setX includes only one second set of l&EC L1i.

At step 126, the stored distance values DV, sample size values SS and confidence-interval length estimates IL are normalised (provided that more than one of each value is stored in the memory 8, 10). This involves mapping each of these stored values to a value on a common scale, in this case zero to one. The distance values DV are mapped such that the highest stored distance value DVi is mapped to one, the smallest stored distance value DVi is mapped to zero, and the other stored distance values are mapped on a linear scale between the smallest and largest values on the scale. The same mapping process is performed on the stored sample size values SS and the stored confidence-interval length estimates IL respectively. At step 128, for each of the second sets of l&EC L1i included in the setX, the associated stored normalised distance value DVi, sample size SS/ and confidence-interval length estimate ILi are linearly combined in order to obtain an overall quality rating Qi for the second set of l&EC L1i. This includes adding or subtracting each of the values, where values for which a smaller value is more desirable (e.g. the distance value DVi and the confidence-interval length estimate iLi) are subtracted, and values for which a larger value is more desirable (e.g. the sample size SS/) are added. Thus, a higher overall quality rating Qi indicates a higher overall quality for the associated second set of l&EC L1i. Alternatively, and equivalently, an aggregated distance could instead be calculated by adding values that are desired to be large and subtracting values that are desired to be small.

Optionally, where weighting values are received at optional step 108, this linear combining of the values at step 128 includes multiplying any values for which n associated weighting value was received by that weighting value before performing the linear combining operation.

At step 130, one or more optimal second sets of l&EC L1i are identified by comparing the overall quality ratings Qi associated with each, e.g. by identifying one or more of the sets L1i having the smallest overall quality ratings Qi. At step 132, the second sets of l&EC L1i are ranked according to their overall quality ratings Qi, and displayed to a user using the display 4 in an ordered list. Steps 130 and 132 are optional, so neither step may be carried out in some embodiments, both steps may be carried out in other embodiments, and in other embodiments only one of the two steps 130, 132 may be carried out. These steps advantageously provide mechanisms for presenting sets of l&EC L1i to a user in a useable manner, so that they can make an informed decision for selecting a set of l&EC L1i for an external control arm for the active arm of a clinical trial which provided the first clinical trial data TO, or for selecting a set of l&EC L1i for use in a subsequent stage of a clinical trial to the stage which provided the first clinical trial data TO.

FIG. 5 is a flowchart illustrating a method 140 of processing l&EC in order to systematically explore and optimise sets of possible l&EC. In this embodiment, the method 140 is carried out by the processor s of the user device 2 shown in FIG. 1 , executing software instructions stored on the NVM 10.

At step 142, a first set of l&EC L0 is received. At step 144, a first set of attribute data M0 is received which satisfies the first set of l&EC L0. At step 146, a first set of clinical trial data TO for an active arm of a clinical trial performed using subjects that satisfy the first set of l&EC L0 is received. At optional step 108, one or more weighting values are received. Steps 142 to 148 are performed in substantially the same manner as described previously with reference to steps 102 to 108 of FIG. 4.

At step 150, a second set of l&EC L1 is received in substantially the same manner as described previously with reference to step 36 of FIG. 2. The second set of l&EC L1 is stored in the volatile memory 8 or the NVM 10 as a current second set of l&EC L1c. At step 152, a current second set of attribute data M1c is received. The current second set of attribute data M1c includes attributes of one or more individuals which satisfy the current second set of l&EC L1c, and is received in the same manner as described previously with reference to step 34 of FIG. 2.

At step 154, a current second set of clinical trial data T1 c for an external control arm of a clinical trial is received which includes clinical trial data for one or more individuals satisfying the current second set of l&EC L1c, and is received in substantially the same manner as described previously with reference to step 116 of FIG. 4.

At step 156, a distance value DVc indicative of a distance between the first set of attribute data MO and the current second set of attribute data M1c is calculated and stored in the volatile memory 8 or the NVM 10, associated with the current second set of l&EC L1c. At step 158, a sample size SSc of the current second set of clinical trial data T1 c is computed and stored in the volatile memory 8 or the NVM 10, associated with the current second set of l&EC L1c. At step 160, the length of a confidence interval of a treatment effect from the first set of clinical trial data TO and the current second set of clinical trial data T1c is estimated, and the resultant confidence-interval length estimate ILc is stored in the volatile memory 8 or the NVM 10, associated with the current second set of l&EC L1c. These steps 156, 158 and 160 are performed in substantially the same manner as described previously with reference to steps 118, 120 and 122 of FIG. 4 respectively.

At step 162, the distance value DVc, sample size SSc and confidence-interval length estimate ILc are linearly combined in order to obtain an overall quality rating Qc for the current second set of l&EC L1c. This includes adding or subtracting each of the values, where values for which a smaller value is more desirable are subtracted, and values for which a larger value is more desirable are added.

In later iterations, step 162 may further comprise performing a normalisation operation between the current values DVc, SSc and ILc, and those from previous iterations before the linear combining operation is performed. Unlike the normalisation described above, where the smallest and largest possible values of DV, SS and IL were known, in this case the range of values to be normalised can change as the iterative optimisation proceeds. Therefore, the smallest and largest values are decided in advance, and the linear normalising transformation is applied based on these predetermined values. For SS, the predetermined smallest value is one and the predetermined largest value is the sample size of the actual registry (i.e. the number of all possible patients). For IL, the predetermined smallest value is zero and the predetermined largest value is three, assuming the Hazard ratio is used to measure the treatment effect, since this is rarely larger than three (if any values exceed three, the normalisation simply sets them equal to three). For DV, the predetermined smallest value is zero and the predetermined largest value is set as follows: take the list of L0 and move all the criteria as far away as possible from L0; this new list Lmax is really far away from L0; now compute the distance between LO and Lmax, and set this as the largest possible value of DV. This calculation should be repeated for every study (every new LO).

Optionally, where weighting values are received at optional step 148, this linear combining of the values at step 162 includes multiplying any values for which an associated weighting value was received by that weighting value before performing the linear combining operation.

At step 164, it is determined whether to stop iterating through different possible second sets of l&EC L1. This determination may be made by determining whether all possible sets of l&EC L1 have been explored, or by determining whether the overall quality rating Qc has been (at least locally) optimised (e.g. maximised). This may, for example, comprise provisionally calculating whether modifying any of the constraints included in the current second set of l&EC L1c would improve (e.g. increase) the overall quality rating Qc. Other embodiments might additionally or alternatively use one or more other stop conditions such as when a predetermined number of possible l&EC sets has been analysed (e.g. M=1000). This may be useful to control how much time it takes to do the whole analysis. A combination of two criteria may be used: the optimised function does not improve anymore (or improves very little, as it close to the maximum) OR there have been M iterations. When it is determined to continue iterating through different possible second sets of l&EC L1, the method 140 proceeds to step 166. When it is determined to stop iterating through different possible second sets of l&EC L1, the method proceeds to step 170.

At step 166, one of the constraints included in the current second set of l&EC L1c is selected and modified by a single discretised unit thereof, e.g. one year of age, a single cancer stage, etc, in order to obtain a modified second set of l&EC Lira. In one embodiment, a random constraint in the current second set of l&EC L1c is selected and modified at step 166.

In other embodiments, step 166 comprises iterating through each constraint in the current second set of l&EC L1c and modifying it by a single discretised unit of that constraint in order to obtain a temporary modified second set of l&EC Lftemp, and then calculating a temporary overall quality rating Qtemp for the temporary modified second set of l&EC Lftemp according to the method described previously with reference to steps 152 to 162. This process is repeated for each of the constraints in the current second set of l&EC L1c (provided a modification thereof is still possible), and the modified second set of l&EC Lira is selected from the temporary modified second sets of l&EC Lftemp in dependence on the temporary quality ratings calculated for each.

In one embodiment, the temporary modified second set of l&EC Lftemp having the best (i.e. smallest) temporary overall quality rating Qtemp is selected as the modified second set of l&EC Lira. In another embodiment, one of the temporary modified second sets of l&EC Lftemp is selected randomly according to a calculated probability distribution, where the probability of a temporary modified second set of l&EC Lftemp being selected is dependent on the temporary overall quality rating Qtemp calculated for it (e.g. with the temporary modified second set of l&EC L7temp having the best, i.e. smallest, temporary overall quality rating Qtemp receiving the highest probability of being selected).

At step 168, the modified second set of l&EC Lir is stored in the volatile memory 8 or the NVM 10 as the current second set of l&EC L1c. The second set of l&EC L1 before the modification at step 166, along with the overall quality rating Q calculated for it, may be stored in the memory 8, 10 for later access. The method 140 then proceeds back to step 152, where another iteration is performed.

At step 170, the iterative method 140 ends. At this stage, depending on the determination made at step 164, the current second set of l&EC L1c stored in the memory 8, 10 may represent a best (i.e. optimal or optimised) second set of l&EC L1 according to its overall quality rating Q, or there may be a number of a second sets of l&EC L1 and associated overall quality ratings Q stored in the memory 8, 10 which may be ranked and displayed in a similar manner to that described previously with reference to step 132 of FIG. 4, or from which one or more optimal second sets of l&EC L1 may be identified in a similar manner to that described previously with reference to step 130 of FIG. 4.

In some embodiments, the system 1 may be used for analysing clinical trials after they have been performed. It can allow a user to contextualise results of a clinical trial by adjusting l&EC and examining how modifications to l&ECs impact the population size and efficacy estimates.

Validation

Various simulations have been conducted that demonstrate the efficacy of some of the novel principles disclosed herein. The following points summarise what was done:

1. A “true” trial was simulated, with an active and a control arm. For the active arm a certain drug was used and the treatment outcome was measured as a continuous quantity (for example a measurement of some biomarker in the blood). In the control arm, standard treatment was used, and the outcome biomarker was measured. A list of l&EC was simulated, based on several features (covariates) of the patients.

2. A registry was simulated, where only the standard treatment had been used. The outcome of this was recorded with all covariates.

3. Several relaxed lists of l&EC were created, and, for each, a new control arm of patients was produced, selected from the simulated registry.

4. The original control arm was compared with the other ones based on relaxed lists of l&EC, using three distance measures, each according to the present disclosure. In this way the efficacy of these distance measures is demonstrated in allowing the various relaxations to be distinguished and compared.

5. Some relevant plots were produced that compare the various relaxations. Such plots could, for example, help potential users decide which relaxations to choose.

The simulations used simulated data, rather than live data, and considered only a limited number of relaxations and a few distance measures, but they nevertheless demonstrate the utility of methods disclosed herein.

The following covariates of each patient in the registry and the ’’true” trial were simulated: x1 = 1 , 2, or 3, each with probability 1/3 x2 = 1 , 2, 3, or 4, each with probability % x3 is a random variable: uniform in (0, 1) x4, x5, x6 and x7 are random variables: standard normal x8 is a random variable: binomial with probability 0.5 x9 is a random variable: Poisson with mean 4 x10 is a random variable: exponential with rate 0.5

Seven of these covariates are assumed to affect treatment selection: (X1 , . . . ,X7).

Seven of these covariates are assumed to affected the treatment effect: (X4, . . . ,X10). Therefore some covariates directly affect both treatment selection and treatment effect.

The probability of receiving new treatment is the logistic regression logit(pi) = ao, treat +a_mx1 +a_mx2 +a_vsx3 +a_wx4 +a_mx5 +a_sx6 +a_vsx7, where a_w = -0.02, a_m = 0.75, a_s = 1 .5, a_vs = -3

Coefficients a_w, a_m, a_s , a_vs are intended to denote weak, moderate, strong and very strong effects, respectively ao, treat was selected so that the proportion of treated is 20%

Treatment status was generated by sampling a Bernoulli variable per patient /,

Z, ~ Be (pt)

The treatment effect was generated by linear regression

Yi = Zj + a_wx4 + a_mx5 + a_sx6 + a_vsx7 + a_wx8 + a_mx9 + a_sx10 + ei, where €j ~ N(0, o = 3)

Figure 6 shows a graph of the model generating the simulated data. The treatment effect was estimated from data via propensity score matching.

All the covariates that affect the treatment effect were used when estimating the propensity score with a logistic regression model x4, . . . , x10.

The propensity score matching method used was nearest-neighbor matching on the logit of the propensity score with specified caliper. A caliper equal to 0.2 of the standard deviation of the logit of the propensity score was used.

Different relaxations of original l&EC were simulated, as follows:

’’Original” l&EC active arm: x1 = 1 , x2 = 1 , x3 > 0.2

’’Original” l&EC: L0 control arm: x1 = 1 , x2 = 1 , x3 > 0.2

Relaxed l&EC on the first two covariates:

L1 control arm: x1 = 1 or 2, x2 = 1 , x3 > 0.2

L2 control arm: x1 = 1 or 2 or 3, x2 = 1 , x3 > 0.2

L3 control arm: x1 = 1 or 2, x2 = 1 or 2, x3 > 0.2

L4 control arm: x1 = 1 or 2, x2 = 1 or 2 or 3, x3 > 0.2

L5 control arm: x1 = 1 or 2 or 3, x2 = 1 or 2, x3 > 0.2

L6 control arm: x1 = 1 or 2 or 3, x2 = 1 or 2 or 3, x3 > 0.2

Relaxed l&EC, on the third covariate in addition:

L7 control arm: x1 = 1 or 2, x2 = 1 , x3 > 0.1

L8 control arm: x1 = 1 or 2 or 3, x2 = 1 , x3 > 0.1

L9 control arm: x1 = 1 or 2, x2 = 1 or 2, x3 > 0.1

L10 control arm: x1 = 1 or 2, x2 = 1 or 2 or 3, x3 > 0.1

L11 control arm: x1 = 1 or 2 or 3, x2 = 1 or 2, x3 > 0.1

L12 control arm: x1 = 1 or 2 or 3, x2 = 1 or 2 or 3, x3 > 0.1

Relaxed l&EC, on the third covariate even further:

L13 control arm: x1 = 1 or 2, x2 = 1 , x3 > 0

L14 control arm: x1 = 1 or 2 or 3, x2 = 1 , x3 > 0

L15 control arm: x1 = 1 or 2, x2 = 1 or 2, x3 > 0

L16 control arm: x1 = 1 or 2, x2 = 1 or 2 or 3, x3 > 0

L17 control arm: x1 = 1 or 2 or 3, x2 = 1 or 2, x3 > 0

L18 control arm: x1 = 1 or 2 or 3, x2 = 1 or 2 or 3, x3 > 0

Measures were used to compare the various relaxed control arms as follows:

1 . Control size: Sample size of the control arm

2. Confidence interval length: Length of 0.95% confidence interval (Cl) of the treatment effect of the relaxed trial (with the true active and the relaxed control arm).

3. Distance of relaxation, as in Figure 3: Let M0 be the set of individuals in the control arm of the ’’original” study with l&EC L0. Let M2 be the set of individuals in the original active arm. Let M1 be the set of individuals with relaxed l&EC L1 . Propensity score matching of MO with M2 was performed. Call SO the set of such matched controls in the original l&EC. Next propensity score matching of M1 with M2 was performed. Call S1 the set of matched controls in this relaxed new l&EC. Finally the distance is 1 minus the cardinality of the intersection of SO and S1 divided by the number of individuals in the original active arm M2. This distance is between 0 and 1 .

Figure 7 shows, in the first line, the control arms ordered according to increasing sample size. The smallest one is the original control arm, all other are relaxed control arms including more patients.

In the second line of Figure 7, the control arms are ordered according to increasing length of the confidence interval of the treatment effect. The largest one is the original control arm. Smaller lengths are to be preferred.

In the third line of Figure 7, the control arms are ordered according to increasing relaxation distance from the original list L0. The list closet to L0 is L1 , followed by L7.

The three panels in Figure 8 are plots of respective pairs of measures. For example, the second panel shows how the length of the confidence interval (Cl) depends on the distance of relaxation. From this, it might be determined that L15 would be a good choice to make, as the distance of relaxation is not among the largest and the length of Cl is smallest.

Figure 9 represents the data in a way that facilitates comparison of the relaxations according to the three measures together.

Any of the plots shown in Figures 7, 8 or 9, or similar plots generated with real or simulated data, may be rendered for display by a processing system, e.g. to enable a user to compare different sets of l&EC and make an optimal selection.

These simulated examples demonstrate the utility of generating distance values according to methods disclosed herein.

It will be appreciated by those skilled in the art that the invention has been illustrated by describing one or more specific embodiments thereof, but is not limited to these embodiments; many variations and modifications are possible, within the scope of the accompanying claims.

Claims

1 . A computer-implemented method of processing data encoding inclusion and exclusion criteria (l&EC) for selection of subjects for a research study, comprising: receiving first criteria data encoding a first set of l&EC for selection of subjects for a research study, the first set of l&EC comprising a first plurality of constraints each relating to a respective attribute category of a plurality of attribute categories for subjects for inclusion in, or exclusion from, the research study; receiving a first set of attribute data comprising, for each of one or more individuals that comply with the first set of l&EC, a respective attribute for each attribute category of the plurality of attribute categories; receiving second criteria data encoding a second set of l&EC for selection of subjects for a research study, the second set of l&EC comprising a second plurality of constraints each relating to a respective attribute category of the plurality of attribute categories; receiving a second set of attribute data comprising, for each of one or more individuals that comply with the second set of l&EC, a respective attribute for each attribute category of the plurality of attribute categories; processing the attributes of the first and second sets of attribute data to determine a distance between the first and second sets of attribute data; and generating a distance value indicative of the determined distance and storing the distance value in a memory for further processing.

2. The computer-implemented method of claim 1 , wherein the first and second sets of attribute data each encode a respective array or list storing attribute categories along a first dimension and individuals along a second dimension.

3. The computer-implemented method of claim 1 or 2, wherein processing the attributes of the first and second sets of attribute data to determine a distance between the first and second sets of attribute data comprises: performing a normalisation operation on each of the attributes of first set of attribute data in order to obtain a first set of normalised attribute data; performing a normalisation operation on each of the attributes of second set of attribute data in order to obtain a second set of normalised attribute data; and processing the normalised attributes of the first and second sets of normalised attribute data in order to determine the distance between the first and second sets of attribute data.

4. The computer-implemented method of any preceding claim, comprising: performing a dimensionality reduction operation on the first set of normalised attribute data in order to obtain a first dimensionally reduced set of normalised attribute data; and performing a dimensionality reduction operation on the second set of normalised attribute data in order to obtain a second dimensionally reduced set of normalised attribute data.

5. The computer-implemented method of claim 4, wherein performing the dimensionality reduction operation comprises performing a principle component analysis.

6. The computer-implemented method of claim 5, comprising: computing a first number, k1, of principle components of the first dimensionally reduced set of normalised attribute data having a variability metric that is greater than or equal to a predetermined proportion of a variability metric calculated for the first dimensionally reduced set of normalised attribute data; computing a second number, k2, of principle components of the second dimensionally reduced set of normalised attribute data having a variability metric that is greater than or equal to a predetermined proportion of a variance metric calculated for the second dimensionally reduced set of normalised attribute data; determining a maximum value, kmax, of k1 and /c2; selecting a first subset of the first dimensionally reduced set of normalised attribute data, the first subset comprising kmax principle components of the first dimensionally reduced set of normalised attribute data; and selecting a second subset of kmax principle components of the second dimensionally reduced set of normalised attribute data, the second subset comprising kmax principle components of the second dimensionally reduced set of normalised attribute data.

7. The computer-implemented method of claim 6, comprising: determining a first centroid of the first subset; determining a second centroid of the second subset; and computing a Euclidean distance between the first and second centroids, wherein the Euclidean distance measures the distance between the first and second sets of attribute data.

8. The computer-implemented method of any of claims 1 to 3, comprising: receiving a third set of attribute data comprising, for each of one or more research study subjects included in an active arm of a research study, a respective attribute for each attribute category of the plurality of attribute categories; performing a propensity score matching operation between the third set of attribute data and the first set of attribute data in order to obtain a first subset of the first set of attribute data; and performing a propensity score matching operation between the third set of attribute data and the second set of attribute data in order to obtain a second subset of the second set of attribute data.

9. The computer-implemented method of claim 8, comprising: computing an intersection between the first subset and the second subset; computing a cardinality of the intersection in order to obtain a cardinality value indicative of the cardinality; and determining the distance between the first and second sets of attribute data from the cardinality value.

10. The computer-implemented method of claim 9, wherein determining the distance between the first and second sets of attribute data comprises: dividing the cardinality value by a divisor derived from a number of individuals included in the third set of attribute data; and subtracting the result from a predetermined number.

11 . The computer-implemented method of any preceding claim, wherein the first and second sets of attribute data are received from one or more databases of patient data.

12. The computer-implemented method of any preceding claim, comprising: receiving a first set of research study data for an active arm of a first research study performed using research study subjects satisfying the first set of l&EC; accessing one or more databases of patient data in order to obtain a second set of research study data, for a control arm of a research study, comprising individuals satisfying the second set of l&EC; computing a sample size of the second set of research study data, and storing a sample size value indicative of the sample size in the memory; computing an estimate of a length of a confidence interval of a treatment effect of the first research study based on the first set of research study data and the second set of research study data, and storing a confidence-interval length estimate value indicative of the estimated confidence interval length in the memory; and computing an overall quality rating for the second set of l&EC in dependence upon the stored distance value, the stored sample size value and the stored confidence-interval length estimate value, and storing the overall quality rating in the memory.

13. The computer-implemented method of claim 12, wherein computing the overall quality rating for the second set of l&EC comprises linearly combining the stored distance value, the stored sample size value and the stored confidence-interval length estimate value.

14. The computer-implemented method of claim 12 or 13, comprising: receiving one or more weighting values, each weighting value being associated with the distance value, the sample size value, or the confidence-interval length estimate value; and wherein computing the overall quality rating for the second set of l&EC comprises: weighting the stored distance value, the stored sample size value and the stored confidence-interval length estimate value according to the weighting value(s) associated therewith; and combining the weighted distance value, the weighted sample size value and the weighted confidence-interval length estimate value.

15. The computer-implemented method of any of claims 12 to 14, comprising: storing the second set of l&EC in the memory as a current second set of l&EC; and in each of one or more iterations: modifying a constraint included in the current second set of l&EC in order to obtain a modified second set of l&EC, and storing the modified second set of l&EC as the current second set of l&EC in the memory; receiving a revised second set of attribute data comprising, for each of one or more individuals that comply with the current second set of l&EC, a respective attribute for each attribute category of the plurality of attribute categories; accessing the one or more databases of patient data in order to obtain a revised second set of research study data, for a control arm of a research study, comprising individuals satisfying the current second set of l&EC; processing the attributes of the first set of attribute data and the revised second set of attribute data to determine a distance between the first set of attribute data and the revised second set of attribute data, and storing a current distance value indicative of the determined distance in the memory; computing a sample size of the revised second set of research study data, and storing a current sample size value indicative of the sample size in the memory; computing an estimate of a length of a confidence interval of a treatment effect of the first research study based on the first set of research study data and the revised second set of research study data, and storing a new confidence-interval length estimate value indicative of the estimated confidence interval length in the memory; and computing a current overall quality rating for the current second set of l&EC based on the stored current distance value, the stored current sample size value and the stored current confidence-interval length estimate value, and storing the current overall quality rating in the memory.

16. The computer-implemented method of claim 15, wherein modifying the constraint in each iteration comprises: for each constraint included in the current second set of l&EC: modifying the constraint by a single discretised unit thereof in order to obtain a temporary modified second set of l&EC; and computing and storing in the memory a temporary overall quality rating for the temporary modified second set of l&EC; and selecting one of the temporary modified second sets of l&EC to use as the modified second set of l&EC in dependence on the temporary stored overall quality ratings.

17. The computer-implemented method of claim 16, wherein selecting one of the temporary modified second sets of l&EC comprises selecting the temporary modified second set of l&EC having the best overall quality rating.

18. The computer-implemented method of claim 16, wherein selecting one of the temporary modified second sets of l&EC comprises selecting one of the temporary modified second sets of l&EC according to a calculated probability distribution, wherein the probability of a temporary modified second set of l&EC being selected is dependent on the overall quality value associated therewith.

19. The computer-implemented method of any of claims 1 to 11 , comprising: receiving a first set of research study data for an active arm of a first research study performed using research study subjects satisfying the first set of l&EC; receiving data encoding one or more further second sets of l&EC for respective research studies so as to receive data encoding a plurality of second sets of l&EC, each of the plurality of second sets of l&EC comprising a respective second plurality of constraints each relating to a respective attribute category of the plurality of attribute categories; for each of the plurality of second sets of l&EC: receiving a respective second set of attribute data comprising, for each of one or more individuals that comply with the second set of l&EC, a respective attribute for each attribute category of the plurality of attribute categories; accessing one or more databases of patient data to obtain a respective second set of research study data, for a control arm of a research study, comprising individuals which satisfying the second set of l&EC; and processing the attributes of the first set of attribute data and the second set of attribute data to determine a distance between the first and second sets of attribute data, and storing a respective distance value indicative of the determined distance in the memory; computing a sample size of the second set of research study data, and storing a respective sample size value indicative of the sample size in the memory; and computing an estimate of a length of a confidence interval of a treatment effect of the first research study based on the first set of research study data and the second set of research study data, and storing a respective confidence-interval length estimate value indicative of the estimated confidence interval length in the memory; and computing an overall quality rating for each second set of l&EC based on the respective stored distance value, the respective stored sample size value and the respective stored confidence-interval length estimate value associated therewith, and storing the overall quality ratings in the memory.

20. The computer-implemented method of claim 19, comprising: performing a normalisation operation on each stored distance value in order to obtain and store in the memory a respective normalised distance value; performing a normalisation operation on each stored sample size value in order to obtain and store in the memory a respective normalised sample size value; performing a normalisation operation on each stored confidence-interval length estimate value in order to obtain and store in the memory a respective normalised confidence-interval length estimate value; and computing the respective overall quality rating for each second set of l&EC based on the respective normalised distance value, the respective normalised sample size value and the respective normalised confidence-interval length estimate value associated therewith.

21 . The computer-implemented method of claim 20, wherein computing the respective overall quality rating for each second set of l&EC comprises linearly combining the respective normalised distance value, the respective normalised sample size value and the respective normalised confidence-interval length estimate value.

22. The computer-implemented method of claim 20 or 21 , comprising: receiving one or more weighting values, each weighting value being associated with the distance value, the sample size value, or the confidence-interval length estimate value; and wherein computing the overall quality rating for each second set of l&EC comprises: weighting the stored normalised distance value, the stored normalised sample size value and the stored normalised confidence-interval length estimate value according to the weighting value(s) associated therewith; and combining the weighted normalised distance value, the weighted normalised sample size value and the weighted normalised confidence-interval length estimate value.

23. The computer-implemented method of any of claims 19 to 22, comprising identifying an optimal set of one or more of the plurality of second sets of l&EC in dependence on the stored overall quality ratings.

24. The computer-implemented method of any of claims 19 to 23, comprising ranking the plurality of second sets of l&EC according to their respective stored overall quality ratings.

25. The computer-implemented method of claim 24, comprising displaying, on a display, a list of one or more of the plurality of second sets of l&EC ranked according to their respective stored overall quality ratings.

26. The computer-implemented method of any preceding claim, wherein the research study is a clinical trial.

27. A computer system configured to process data, encoding inclusion and exclusion criteria (l&EC) for selection of subjects for a research study, according to the method of any preceding claim.

28. Computer software comprising instructions which, when executed on a computer system, cause the computer system to process data, encoding inclusion and exclusion criteria (l&EC) for selection of subjects for a research study, according to the method of any of claims 1 to 26.