WO2014182725A1

WO2014182725A1 - Matching data from variant databases

Info

Publication number: WO2014182725A1
Application number: PCT/US2014/037006
Authority: WO
Inventors: Ricky Nguyen; Sheryl JOHN; Francisco CAI; Yael PELED; Randall C. WETZEL
Original assignee: Childrens Hospital Los Angeles
Current assignee: Childrens Hospital Los Angeles
Priority date: 2013-05-07
Filing date: 2014-05-06
Publication date: 2014-11-13
Anticipated expiration: 2015-11-07

Abstract

A system for collaborative, data-driven translation of nomenclatures for data integration is described. Translation of observational data leverages both language and mathematics. Innovative scoring algorithms and interactive user interfaces aid in the discovery of matches. A process of sharing and approval enables collaboration despite multi-institutional barriers. While this system was developed with the intention of integrating clinical electronic health record data across multiple hospitals, it can be applied to a wide variety of data from any domain, due to its guiding principles and flexible design.

Description

MATCHING DATA FROM VARIANT DATABASES

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] This application claims the benefit of, and priority to, U.S. Provisional Patent Application No. 61/820,671, filed on May 7, 2013, entitled "SIMILAR TERM ANALYZER," the entire disclosure of which is hereby incorporated herein by reference. Also, this application claims the benefit of, and priority to, U.S. Provisional Patent Application No. 61/863,939, filed on August 9, 2013, entitled "SIMILAR TERM ANALYZER," the entire disclosure of which is hereby incorporated herein by reference.

TECHNICAL FIELD

[0002] The present invention relates to the fields of computer software and record keeping, for example medical record keeping. More specifically, the invention relates to computer software and computer systems for managing data, such as medical data, that is in multiple different formats.

SUMMARY OF THE INVENTION

[0003] Every hospital has an idiosyncratic nomenclature - even if using the same electronic health records (EHR) (also known as electronic medical records (EMR)) vendor, they create their own nomenclatures. Matching data with these disparate idiosyncratic nomenclatures poses a major barrier to combining medical datasets. Constructing these translations by simple comparison of names is not only extremely time-consuming and labor-intensive, but can also be wildly inaccurate. We propose an alternative means other than mere label matching.

[0004] Data has structure: distribution, magnitude, units, and values. Even without linguistic tags, data about the same real-world entity, process, or observation should have very similar structure. Looking for similar structural data in a large medical database provides another way of matching terms that can be automated and reduced to a workflow that greatly facilitates combining datasets with disparate nomenclatures.

[0005] Yet another barrier to data integration is the distributed, isolated nature of the data, separated by geographic distance and the security, privacy, and ownership policies tied to the data. We propose software to assist and expedite the translation process required for data integration, using a data-driven approach that accommodates the distributed nature of collaboration.

[0006] Because hospitals save data under different terms, a major challenge when working with different hospitals is to have a unified dataset. For example, the term "heart rate" can be expressed in several ways, such as: pulse, hr, and h. rate. Additional examples include the various ways to express blood pressure (Systolic Blood Pressure, SBP, BP, Arterial SBP), base excess (Base Excess, BE, BEa, ABG BE), the various methods for taking temperature (axillary, oral, rectal), and the various ways to express amounts (e.g., mEq/L, mmol/L, ml, cc). Also, data is collected and expressed in many different languages. The present invention is a data mapping tool that maps similar terms from any dataset to a base dataset. The data-mapping tool is based on the distribution, scale, range, and value of individual clinical parameter observations (such as heart rate, pressures, lab values). Clinical data has a characteristic distribution; it has a human- readable display name and an associated unit of measure. The software and system of the invention use machine-learning techniques to compare these features (distribution, display name, and units). Algorithms, such as the Kullback-Leibler divergence and the Earth Mover's Distance (EMD), are used to measure distribution dissimilarity; the UMLS (Unified Medical Language System) Metathesaurus for semantic comparison; and string clustering methods in combination with knowledge bases, such as UCUM (Unified Code for Units of Measure), are used to compare, convert and combine the parameters of individual datasets that often have disparate semantic structure. This entire process can be expressed in workflows and can be applied to other data mapping challenges, thus assisting in mapping data sets to a common data structure for analysis.

[0007] A critical requirement of data integration is the translation of individual, disparate nomenclatures into a single common nomenclature. Constructing these translations by simple comparison of names is not only extremely time-consuming and labor-intensive, but can also be wildly inaccurate. In addition, coordination between collaborating institutions is made more difficult by geographic distance and the security, privacy, and ownership policies regarding the data to be integrated. The present invention provides software to assist and expedite the translation process required for data integration, using a data-driven approach that accommodates the distributed nature of collaboration.

[0008] Adoption of EHR and the digitization of patient data have opened up a world of computational possibilities. The invention enables the extraction of information from complex clinical databases into a single, flexible system. To date, the diverse proprietary EHR formats unique to each institution have locked important usable data in non-interoperable information silos. Traditional data integration methods have proven to be inflexible when dealing with the ad-hoc nature of clinical research and also proven to be cumbersome when dealing with the idiosyncratic content and structure of multi-institutional EHR data. The present invention provides a suite of software, running on appropriate computer hardware, that has the flexibility and agility to overcome these obstacles, using the data itself to facilitate the building of integrated clinical datasets that can be shared among a network of hospital intensive care units. Dynamic terminologies and data-driven mapping are used to integrate and disseminate clinical research datasets.

[0009] Standardized medical ontologies and terminologies, such as SNOMED-CT, LOINC, and the UMLS Metathesaurus, have come a long way in improving interoperability of medical data. However, for domain-specific research, these centralized, curated ontologies have shortcomings. Among those shortcomings, these formal ontologies often do not extend into certain medical domains with the specificity required. And, submitted change requests must go through lengthy review and approval processes. In contrast, the present invention allows dynamic, evolving terminologies to be defined by the researcher without eliminating the possibility of later adopting a standardized ontology. In essence, the invention decouples the evolution of a researcher's terminology from the glacial pace of curated ontologies, allowing both flexibility for domain specificity and the potential to take advantage of interoperable standards.

[0010] To integrate datasets, all data elements must be translated, or mapped, from their native nomenclatures to a common nomenclature. This translation process, when solved by data entry or expert identification of equivalent labels, is a time-consuming and laborious task. Software solutions that automate this job by comparing medical terms and abbreviations do not always produce reliable mappings. Such software approaches often fail because they completely ignore the most useful pieces of information: the observation data associated with each element.

[0011] The present invention provides mappings with an interactive tool providing instant visualization and feedback. As a non-limiting example, FIGS. 1A and IB show how the invention can be used to determine what type of data is represented by ambiguous data labels. In this example, two observations are provided relating to "Arterial BP". The data for each of these observations are compared to data known to relate to systolic blood pressure. In FIG. 1A, the "Arterial BP" data (dark histogram, concentrated more to the left side of FIG. 1 A) maps well to data known to relate to systolic blood pressure (light histogram, concentrated more to the right side of FIG. 1A), while in FIG. IB, the "Arterial BP" data (dark histogram, concentrated more to the left side of FIG. IB) does not map well to the data known to relate to systolic blood pressure (light histogram, concentrated more to the right side of FIG. IB). As such, the ambiguous label "Arterial BP" can be resolved into "Arterial Systolic BP" in FIG. 1A and "Arterial Diastolic BP" in FIG. IB. In other words, when measuring what is called "systolic blood pressure", "BP systolic", "SBP", or any other term representing that measurement, the observed data will exhibit the same distribution of values as other, known values for that measurement because they measure the same physical entity. Statistics and information theory provide several methods to measure distribution similarity. The present invention exploits the similarity of distributions, as well as probabilistic and semantic string matching techniques, to identify and recommend matching elements in heterogeneous, idiosyncratic databases.

[0012] In FIG. 1A, the x-axis includes labels from 50 to 150 in 10 unit increments, and the y-axis includes labels from 0.000 to 0.020 in 0.002 unit increments. In FIG. IB, the x- axis includes labels from 30 to 150 in 10 unit increments, and the y-axis includes labels from 0.000 to 0.030 in 0.005 unit increments.

[0013] Once matching elements have been identified, integration can begin. The observational data can be enriched with the mappings of not just one, but any of numerous terminologies. For example, a "Pulse Oximetry" observation may map to "Sp02" in one terminology and to "Oxygen Saturation" in another. The invention can generate multiple mappings from many sources to a single terminology for a particular application. Additionally, the invention can also generate mappings for multiple applications. The invention allows one to easily map data from any source to any terminology.

[0014] The usefulness of an integrated dataset extends far beyond the medical field. These integrated datasets can enable data-sharing networks, where datasets are described by searchable metadata. Browsers of the network can easily discover datasets applicable to their own research or application needs. As non-limiting examples, the invention is applicable to data research communities in the domains of climate science, planetary science, earth science, radio astronomy, and cancer biomarkers. The software and system of the invention provide a flexible storage method, evolving terminologies for various applications, data-driven mapping of elements, and the building and sharing of generic integrated datasets. The complexity of medical data and the flexibility required by clinical research has prompted the solution, but the solution is not limited to that field. Because the terminology is decoupled, and because the data itself drives the mapping process, this system is transferable to data from any domain. [0015] In various embodiments, the present invention can be employed by intensive care units, can provide an automated data entry solution to save time and money, can map data from an individual site to a network's terminology, can pre-populate fields for review and submission to a data repository, and can include an integrated data repository that can run dynamic reports for quality improvement.

[0016] In various embodiments, the present invention can comprise a terminology- matching recommendation system, which is an example of a machine learning classifier. The performance of the present invention can be evaluated using measures such as precision, recall and an area under a receive operating characteristic curve. User-generated mappings (as a labeled training set) can be compared with recommendations calculated by various testable algorithms.

[0017] In various embodiments, the present invention can enable data integration without access to personal health information (PHI), can reduce time-consuming steps for identifying data, can provide confidence in matching, is more accurate than known methods and systems and can provide more complete mapping relative to known methods and systems.

[0018] In various embodiments, the present invention can use raw input data to better reveal data of interest, can use features of the data to discover similar data, can provide visualizations of the data to ease mapping and can incorporate self-improving routines through aggregation of data.

[0019] In various embodiments, the present invention can include functions of search (to reduce time), validation (to increase confidence), a review for accuracy and a review for completeness. The search function can provide facilities, including but not limited to, browsing, filtering, sorting, navigating based on data attributes to help the user find similar data. The validation function can be used to visually confirm the similarity of data attributes, like for example histograms and statistics, to determine a likely match. The accuracy function can include quantitative scoring and qualitative graphs to ensure the best matches are discovered. The completeness function can ensure that, if more than one match exists, the user can discover all appropriate matches.

[0020] In various embodiments, the present invention can be used to significantly increase research capabilities, to facilitate multi-site research, to provide an application layer that is EMR-independent, that can be monetized and that can be mobile. [0021] In various embodiments, the present invention can utilize any suitable user interface, can include a combined score function, can utilize one or more algorithms and can provide hospital categorization.

[0022] In one aspect, provided herein is a computer-implemented system for identifying matching data of a real-world, measurable concept, where the data have inconsistent associated descriptive labels, said system comprising: computer software executed on appropriate computer hardware, wherein the software executes the following method steps: establish elements of a dataset; compare observations of a real-world occurrence of a measurable concept to appropriate elements of the dataset; output data representing the compared observation and elements of the dataset; wherein the output indicates whether the compared observation and elements of the dataset represent the same real-world, measurable concept.

[0023] In one embodiment of this aspect, the step of establishing elements of a dataset is performed based on a single set of data.

[0024] In another embodiment of this aspect, the step of establishing elements of a dataset is performed based on multiple inputs of data from multiple sources.

[0025] In another embodiment of this aspect, the output data is a graphical representation indicating whether the observations were consistent with the selected elements of the dataset.

[0026] In another aspect, provided herein is a computer-implemented system for matching data of a real-world measurable concept, where the data have inconsistent associated identifying or descriptive labels, said system comprising: a) a Term Generation Framework, in which observational data, stored in any of a wide variety of formats, is used to compute the aggregate facts (Terms) about observations of a single entity, wherein aggregate facts include but are not limited to histograms, summary statistics, descriptive labels, and units of measure; a single Term derives from a single data source, but different Terms may come from multiple data sources, differentiated by geography, physical location, software measures, hardware, data formats, policies, and security; b) a module for creation of dataset and elements, which establishes a single purpose for a Dataset and establishes the required real-world data Elements that serve that purpose; c) an Output module for mapping/correlation, indicating equivalence between Terms and Elements of a Dataset; d) a Score Generation Framework, in which the construction of the output is aided by the computation of the relative similarity of aggregate facts about observations of a single entity; similarity is expressed according to, but not limited to, the following features: distribution, descriptive label, and units of measure; distribution similarity is computed using statistical or information theoretic measures, including Kullback-Leibler Divergence estimates and Earth Mover's Distance; label similarity is computed using string distance metrics, including variations on Jaro-Winkler, and Level2Jaro-Winkler; label similarity incorporates semantic meaning from established semantic databases, such as the UMLS Metathesaurus and its constituent ontologies and terminologies; units similarity is determined from standard units in practice, equivalent units, common abbreviations, and established unit standards such as UCUM; and composite similarity across multiple features is computed as well, including but not limited to a linear combination (weighted sum) of individual features or the naive bayes classifier; e) a Data API, which is an application programming interface that provides storage, retrieval, and manipulation of data necessary for the collaborative term mapping process, including but not limited to Terms, Scores, Elements, Datasets, Matches, and AggregateTerms; and f) a Web Application, which is a computer software user interface that i) facilitates the assignment of Terms to Elements of a Dataset and utilizes the Data API to assign a Term to an Element, and provides a way to browse, search, sort, and filter candidate Terms based on criteria including not limited to descriptive label, units of measure, similarity scores, and summary statistics, ii) evaluates the relevancy of a Term to an Element, wherein supporting data is displayed so it can be visualized by a user, wherein the supporting data includes but is not limited to visual histogram charts, similarity scores, descriptive labels, summary statistics, units of measure, tabular numeric data, and time series data, iii) suggests Terms to be assigned to Elements by creating a Match, wherein matches are created by one party may be shared with another party for review in the same fashion as evaluation of relevancy with supporting data, wherein matches may be approved and subsequently merged, and wherein the result of merging Matches is a new version of an Element containing data from the merged Term and a new version of a Dataset containing the new Elements.

[0027] In another aspect, provided herein is a computer implemented method for analyzing databases, comprising: on a device having one or more processors and a memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for: 1. obtaining a first aggregate attribute from first data stored in a first database, the first database comprising a first element; 2. obtaining a second aggregate attribute from second data stored in a second database, the second database comprising a second element; 3. comparing the first aggregate attribute of the first element with the second aggregate attribute of the second element; 4. obtaining a set of features as a result of the comparing step, the set of features comprising a quantitative measure of a similarity of the first aggregate attribute of the first element and the second aggregate attribute of the second element; 5. determining whether the first element and second element are equivalent to each other; and 6. outputting a map that relates the first element from the first database with the second element from the second database.

[0028] In one embodiment of this aspect, the first aggregate attribute comprises one or more from the group consisting of a first statistical measurement derived from the first aggregate attribute, a first label derived from the first aggregate attribute and a first unit of measurement associated and the first aggregate attribute, and the second aggregate attribute comprises one or more from the group consisting of a second statistical measurement derived from the second aggregate attribute, a second label derived from the second aggregate attribute and a second unit of measurement associated and the second aggregate attribute.

[0029] In another embodiment of this aspect, the first statistical measurement is a first histogram comprising an x-axis of observed values for the first aggregate attribute, wherein the second statistical measurement is a second histogram comprising an x-axis of observed values for the second aggregate attribute, and the step of comparing comprises a statistical comparison of the first and second histograms.

[0030] In another embodiment of this aspect, the statistical comparison of the first and second histograms comprises one from the group consisting of a probability-distance measure, Kullback-Leibler divergence, Earth Mover's Distance, Kolmogorov-Smirnoff similarity and Anderson-Darling similarity.

[0031] In another embodiment of this aspect, the first label and the second label are compared in the step of comparing by using one from the group consisting of a string-edit technique, a token-based distance technique, a semantic comparison technique, a lexical similarity technique, a language system, SNOMED-CT, LOINC, UMLS, a thesaurus, synonyms of the first and second labels, a string-matching technique, a Level2Jaro-Winkler hybrid distance technique and prevalence similarity.

[0032] In another embodiment of this aspect, the first unit of measurement and the second unit of measurement are compared in the step of comparing by using one from the group consisting of a widely accepted standard unit, a list of different spellings relating to the standard unit, a list of commonly used alternative units, a list of other units measuring the same quantity and a substitution of the standard unit for a missing unit.

[0033] In another embodiment of this aspect, the one or more programs further include instructions for: displaying a graphical user interface comprising one or more fields for displaying information corresponding with one or more of the first database, the first element, the second database, the second element, the first aggregate attribute, the second aggregate attribute and the comparing of the first aggregate attribute and the second aggregate attribute.

[0034] In another embodiment of this aspect, the graphical user interface comprises one or more fields for displaying information corresponding with one or more of the first statistical measurement, the first label, the first unit of measurement, the second statistical measurement, the second label and the second unit of measurement.

[0035] In another embodiment of this aspect, the graphical user interface comprises one or more fields for displaying information corresponding with one or more of the first histogram, the second histogram and the statistical comparison.

[0036] In another embodiment of this aspect, the step of outputting the map comprises displaying a list of first elements or a list of second elements sorted by a statistical score while concurrently displaying one or more histograms based on the first and second aggregate attributes.

[0037] In another embodiment of this aspect, the graphical user interface is adapted to allow a user to select from a plurality of first databases, a plurality of first elements, a plurality of second databases, a plurality of second elements and a plurality of systems for comparing the first aggregate attribute and the second aggregate attribute.

[0038] In another embodiment of this aspect, the graphical user interface further comprises a scatterplot based on the first aggregate attribute and the second aggregate attribute.

[0039] In another embodiment of this aspect, the one or more programs further include instructions for calculating a match prediction score.

[0040] In another embodiment of this aspect, the one or more programs further include instructions for repeating steps 1-5 for every combination of elements from each database.

[0041] In another aspect, provided herein is a computer system for analyzing databases, comprising: one or more processors; and memory to store: one or more programs, the one or more programs comprising instructions for: 1. obtaining a first aggregate attribute from first data stored in a first database, the first database comprising a first element; 2. obtaining a second aggregate attribute from second data stored in a second database, the second database comprising a second element; 3. comparing the first aggregate attribute of the first element with the second aggregate attribute of the second element; 4. obtaining a set of features as a result of the comparing step, the set of features comprising a quantitative measure of a similarity of the first aggregate attribute of the first element and the second aggregate attribute of the second element; 5. determining whether the first element and second element are equivalent to each other; and 6. outputting a map that relates the first element from the first database with the second element from the second database.

[0042] In one embodiment of this aspect, the first aggregate attribute comprises one or more from the group consisting of a first statistical measurement derived from the first aggregate attribute, a first label derived from the first aggregate attribute and a first unit of measurement associated and the first aggregate attribute, and the second aggregate attribute comprises one or more from the group consisting of a second statistical measurement derived from the second aggregate attribute, a second label derived from the second aggregate attribute and a second unit of measurement associated and the second aggregate attribute.

[0043] In another embodiment of this aspect, the first statistical measurement is a first histogram comprising an x-axis of observed values for the first aggregate attribute, wherein the second statistical measurement is a second histogram comprising an x-axis of observed values for the second aggregate attribute, and the step of comparing comprises a statistical comparison of the first and second histograms.

[0044] In another embodiment of this aspect, the statistical comparison of the first and second histograms comprises one from the group consisting of a probability-distance measure, Kullback-Leibler divergence, Earth Mover's Distance, Kolmogorov-Smirnoff similarity and Anderson-Darling similarity.

[0045] In another embodiment of this aspect, the first label and the second label are compared in the step of comparing by using one from the group consisting of a string-edit technique, a token-based distance technique, a semantic comparison technique, a lexical similarity technique, a language system, SNOMED-CT, LOINC, UMLS, a thesaurus, synonyms of the first and second labels, a string-matching technique, a Level2Jaro-Winkler hybrid distance technique and prevalence similarity.

[0046] In another embodiment of this aspect, the first unit of measurement and the second unit of measurement are compared in the step of comparing by using one from the group consisting of a widely accepted standard unit, a list of different spellings relating to the standard unit, a list of commonly used alternative units, a list of other units measuring the same quantity and a substitution of the standard unit for a missing unit.

[0047] In another embodiment of this aspect, the one or more programs further include instructions for: displaying a graphical user interface comprising one or more fields for displaying information corresponding with one or more of the first database, the first element, the second database, the second element, the first aggregate attribute, the second aggregate attribute and the comparing of the first aggregate attribute and the second aggregate attribute.

[0048] In another embodiment of this aspect, the graphical user interface comprises one or more fields for displaying information corresponding with one or more of the first statistical measurement, the first label, the first unit of measurement, the second statistical measurement, the second label and the second unit of measurement.

[0049] In another embodiment of this aspect, the graphical user interface comprises one or more fields for displaying information corresponding with one or more of the first histogram, the second histogram and the statistical comparison.

[0050] In another embodiment of this aspect, the step of outputting the map comprises displaying a list of first elements or a list of second elements sorted by a statistical score while concurrently displaying one or more histograms based on the first and second aggregate attributes.

[0051] In another embodiment of this aspect, the graphical user interface is adapted to allow a user to select from a plurality of first databases, a plurality of first elements, a plurality of second databases, a plurality of second elements and a plurality of systems for comparing the first aggregate attribute and the second aggregate attribute.

[0052] In another embodiment of this aspect, the graphical user interface further comprises a scatterplot based on the first aggregate attribute and the second aggregate attribute.

[0053] In another embodiment of this aspect, the one or more programs further include instructions for calculating a match prediction score.

[0054] In another embodiment of this aspect, the one or more programs further include instructions for repeating steps 1-5 for every combination of elements from each database.

[0055] In another aspect, provided herein is a non-transitory computer-readable storage medium storing one or more programs for analyzing databases, the one or more programs for execution by one or more processors of a computer system, the one or more programs comprising instructions for: 1. obtaining a first aggregate attribute from first data stored in a first database, the first database comprising a first element; 2. obtaining a second aggregate attribute from second data stored in a second database, the second database comprising a second element; 3. comparing the first aggregate attribute of the first element with the second aggregate attribute of the second element; 4. obtaining a set of features as a result of the comparing step, the set of features comprising a quantitative measure of a similarity of the first aggregate attribute of the first element and the second aggregate attribute of the second element; 5. determining whether the first element and second element are equivalent to each other; and 6. outputting a map that relates the first element from the first database with the second element from the second database.

[0056] In one embodiment of this aspect, the first aggregate attribute comprises one or more from the group consisting of a first statistical measurement derived from the first aggregate attribute, a first label derived from the first aggregate attribute and a first unit of measurement associated and the first aggregate attribute, and the second aggregate attribute comprises one or more from the group consisting of a second statistical measurement derived from the second aggregate attribute, a second label derived from the second aggregate attribute and a second unit of measurement associated and the second aggregate attribute.

[0057] In another embodiment of this aspect, the first statistical measurement is a first histogram comprising an x-axis of observed values for the first aggregate attribute, wherein the second statistical measurement is a second histogram comprising an x-axis of observed values for the second aggregate attribute, and the step of comparing comprises a statistical comparison of the first and second histograms.

[0058] In another embodiment of this aspect, the statistical comparison of the first and second histograms comprises one from the group consisting of a probability-distance measure, Kullback-Leibler divergence, Earth Mover's Distance, Kolmogorov-Smirnoff similarity and Anderson-Darling similarity.

[0059] In another embodiment of this aspect, the first label and the second label are compared in the step of comparing by using one from the group consisting of a string-edit technique, a token-based distance technique, a semantic comparison technique, a lexical similarity technique, a language system, SNOMED-CT, LOINC, UMLS, a thesaurus, synonyms of the first and second labels, a string-matching technique, a Level2Jaro-Winkler hybrid distance technique and prevalence similarity.

[0060] In another embodiment of this aspect, the first unit of measurement and the second unit of measurement are compared in the step of comparing by using one from the group consisting of a widely accepted standard unit, a list of different spellings relating to the standard unit, a list of commonly used alternative units, a list of other units measuring the same quantity and a substitution of the standard unit for a missing unit. [0061] In another embodiment of this aspect, the one or more programs further include instructions for: displaying a graphical user interface comprising one or more fields for displaying information corresponding with one or more of the first database, the first element, the second database, the second element, the first aggregate attribute, the second aggregate attribute and the comparing of the first aggregate attribute and the second aggregate attribute.

[0062] In another embodiment of this aspect, the graphical user interface comprises one or more fields for displaying information corresponding with one or more of the first statistical measurement, the first label, the first unit of measurement, the second statistical measurement, the second label and the second unit of measurement.

[0063] In another embodiment of this aspect, the graphical user interface comprises one or more fields for displaying information corresponding with one or more of the first histogram, the second histogram and the statistical comparison.

[0064] In another embodiment of this aspect, the step of outputting the map comprises displaying a list of first elements or a list of second elements sorted by a statistical score while concurrently displaying one or more histograms based on the first and second aggregate attributes.

[0065] In another embodiment of this aspect, the graphical user interface is adapted to allow a user to select from a plurality of first databases, a plurality of first elements, a plurality of second databases, a plurality of second elements and a plurality of systems for comparing the first aggregate attribute and the second aggregate attribute.

[0066] In another embodiment of this aspect, the graphical user interface further comprises a scatterplot based on the first aggregate attribute and the second aggregate attribute.

[0067] In another embodiment of this aspect, the one or more programs further include instructions for calculating a match prediction score.

[0068] In another embodiment of this aspect, the one or more programs further include instructions for repeating steps 1-5 for every combination of elements from each database.

BRIEF DESCRIPTION OF THE DRAWINGS

[0069] The accompanying drawings, which are incorporated into this specification, illustrate one or more exemplary embodiments of the inventions disclosed herein and, together with the detailed description, serve to explain the principles and exemplary implementations of these inventions. One of skill in the art will understand that the drawings are illustrative only, and that what is depicted therein may be adapted based on the text of the specification and the spirit and scope of the teachings herein.

[0070] In the drawings, where like reference numerals refer to like reference in the specification:

[0071] FIG. 1A is a composite histogram showing matching data from an ambiguous sample labeled "Arterial BP" with known systolic blood pressure data, showing that the "Arterial BP" data represents arterial systolic blood pressure data;

[0072] FIG. IB is a composite histogram showing non-matching data from the ambiguous sample of FIG. 1A labeled "Arterial BP" with known systolic blood pressure data, showing that the "Arterial BP" data does not represent arterial systolic blood pressure, and thus must represent arterial diastolic blood pressure data;

[0073] FIG. 2 depicts an implementation of the software, method, and system of the invention as it relates to collaborative preparation of a manuscript;

[0074] FIG. 3 demonstrates an example of a Score Generation Framework with the various process modules; given two sets of Terms from sources P and Q, the framework computes Scores for every pair of Terms across the two sets;

[0075] FIG. 4 outlines an example of a process to calculate a similarity score for a term pair of name strings (p,q);

[0076] FIG. 5 illustrates an example of a Data Model according to the present invention, which includes relationships between data model objects used throughout the collaborative matching process, where FIG. 5A is an expanded view of the left side of the Data Model and FIG. 5B is an expanded view of the right side of the Data Model;

[0077] FIG. 6 illustrates an example of Software Architecture according to the present invention;

[0078] FIG. 7 illustrates an example of a screenshot of a Web Application according to the present invention, where FIG. 7A is an expanded view of the left side of the screenshot and FIG. 7B is an expanded view of the right side of the screenshot;

[0079] FIG. 8 illustrates an example of information flow according to the present invention;

[0080] FIG. 9 illustrates an example of a screenshot of a Web Application according to the present invention, where FIG. 9A is an expanded view of the left side of the screenshot and FIG. 9B is an expanded view of the right side of the screenshot;

[0081] FIG. 10A depicts a first example of output according to the present invention; [0082] FIG. 10B depicts a second example of output according to the present invention;

[0083] FIG. IOC depicts a third example of output according to the present invention;

[0084] FIG. 10D depicts a fourth example of output according to the present invention;

[0085] FIG. 10E depicts a fifth example of output according to the present invention;

[0086] FIG. 10F depicts a sixth example of output according to the present invention;

[0087] FIG. 10G depicts a seventh example of output according to the present invention;

[0088] FIG. 11 illustrates an example of a screenshot of a Web Application according to the present invention, where FIG. 11A is an expanded view of the left side of the screenshot and FIG. 11B is an expanded view of the right side of the screenshot;

[0089] FIG. 12 illustrates an example of a screenshot of a Web Application according to the present invention, where FIG. 12A is an expanded view of the left side of the screenshot and FIG. 12B is an expanded view of the right side of the screenshot;

[0090] FIG. 13 illustrates an example of a screenshot of a Web Application according to the present invention, where FIG. 13A is an expanded view of the left side of the screenshot and FIG. 13B is an expanded view of the right side of the screenshot;

[0091] FIG. 14 illustrates an example of a flow diagram for a match prediction score system according to the present invention; and

[0092] FIG. 15 depicts a computer device or system according to the present invention comprising one or more processors and a memory storing one or more programs for execution by the one or more processors.

DETAILED DESCRIPTION

[0093] It should be understood that this invention is not limited to the particular methodology, protocols, etc., described herein and as such may vary. The terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present invention, which is defined solely by the claims.

[0094] As used herein and in the claims, the singular forms include the plural reference and vice versa unless the context clearly indicates otherwise. Other than in the operating examples, or where otherwise indicated, all numbers expressing quantities used herein should be understood as modified in all instances by the term "about." [0095] All publications identified are expressly incorporated herein by reference for the purpose of describing and disclosing, for example, the methodologies described in such publications that might be used in connection with the present invention. These publications are provided solely for their disclosure prior to the filing date of the present application. Nothing in this regard should be construed as an admission that the inventors are not entitled to antedate such disclosure by virtue of prior invention or for any other reason. All statements as to the date or representation as to the contents of these documents is based on the information available to the applicants and does not constitute any admission as to the correctness of the dates or contents of these documents.

[0096] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as those commonly understood to one of ordinary skill in the art to which this invention pertains. Although any known methods, devices, and materials may be used in the practice or testing of the invention, the methods, devices, and materials in this regard are described herein.

[0097] Some Selected Definitions

[0098] Unless stated otherwise, or implicit from context, the following terms and phrases include the meanings provided below. Unless explicitly stated otherwise, or apparent from context, the terms and phrases below do not exclude the meaning that the term or phrase has acquired in the art to which it pertains. The definitions are provided to aid in describing particular embodiments of the aspects described herein, and are not intended to limit the claimed invention, because the scope of the invention is limited only by the claims. Further, unless otherwise required by context, singular terms shall include pluralities and plural terms shall include the singular.

[0099] As used herein the term "comprising" or "comprises" is used in reference to compositions, methods, and respective component(s) thereof, that are essential to the invention, yet open to the inclusion of unspecified elements, whether essential or not.

[0100] As used herein the term "consisting essentially of refers to those elements required for a given embodiment. The term permits the presence of additional elements that do not materially affect the basic and novel or functional characteristic(s) of that embodiment of the invention.

[0101] The term "consisting of refers to compositions, methods, and respective components thereof as described herein, which are exclusive of any element not recited in that description of the embodiment. [0102] Other than in the operating examples, or where otherwise indicated, all numbers expressing quantities used herein should be understood as modified in all instances by the term "about." The term "about" when used in connection with percentages may mean ±1%.

[0103] The singular terms "a," "an," and "the" include plural referents unless context clearly indicates otherwise. Similarly, the word "or" is intended to include "and" unless the context clearly indicates otherwise. Thus for example, references to "the method" includes one or more methods, and/or steps of the type described herein and/or which will become apparent to those persons skilled in the art upon reading this disclosure and so forth.

[0104] Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of this disclosure, suitable methods and materials are described below. The term "comprises" means "includes." The abbreviation, "e.g." is derived from the Latin exempli gratia, and is used herein to indicate a non-limiting example. Thus, the abbreviation "e.g." is synonymous with the term "for example."

[0105] To begin with, the reader should understand certain concepts and terms used in the following detailed description.

[0106] As used herein, Observations measure occurrences of a real-world entity. These observations are stored in a data source and are recorded with an identifier (referred to herein as a name or label), a value, and optionally other data associated with the observation (such as units of measurement). Different sources often have different names for observations of the same entity.

[0107] As used herein, a Term is a collection of aggregate facts about observations of a single entity. This may include a name, units of measurement, a histogram of observed values, and summary statistics (such as mean and standard deviation). Two or more Terms may be deemed equivalent in accordance with a particular definition of an entity, called a data Element. A collection of Element definitions constitute a Dataset. All Terms that are equivalent under the definition of an Element can be combined into a single AggregateTerm representative of that Element.

[0108] A measure of similarity or dissimilarity between two Terms (or between a Term and an AggregateTerm) is called a Score. When a Term satisfies the definition of an Element, a Match may be inferred between the two. When a Match is approved, the Term may be merged with an existing Element; the result is that its AggregateTerm now includes facts from the merged Term. [0109] To further assist the reader in following the terminology used herein, the following Glossary is provided:

[0110] entity - a real-world, measurable concept

[0111] observation - the recording of an occurrence of an entity

[0112] source - the data storage of observations

[0113] name (also label) - the identifier of an observation

[0114] value - the recorded value of the observation

[0115] term - the collection of aggregate facts about observations of a single entity

[0116] element - a particular definition of an entity, used to establish equivalence among terms

[0117] dataset - a collection of elements

[0118] aggregateterm - a single term representing the aggregate facts of an element

[0119] score - a measure of similarity or dissimilarity between two terms

[0120] match - an inference that a term satisfies an element definition

[0121] approval - confirmation that a match is correct according to the element definition

[0122] merging - the operation of adding an approved term to an element's aggregateterm

[0123] seed - a dataset, its elements and their aggregateterms

[0124] review - a seed, along with matches and terms proposed for approval and merging

[0125] PI - the primary investigator, one who defines a dataset and its elements, distributes seeds, approves matches, and merges terms

[0126] collaborator - one who submits matches for approval and terms for merging in a review

[0127] To the extent not already indicated, it will be understood by those of ordinary skill in the art that any one of the various embodiments herein described and illustrated may be further modified to incorporate features shown in any of the other embodiments disclosed herein.

[0128] The following examples illustrate some embodiments and aspects of the invention. It will be apparent to those skilled in the relevant art that various modifications, additions, substitutions, and the like can be performed without altering the spirit or scope of the invention, and such modifications and variations are encompassed within the scope of the invention as defined in the claims which follow. The following examples do not in any way limit the invention.

[0129] The software of the invention, the method implemented by the software, and the system of software and hardware are now described. At times, the invention is described with reference to a primary investigator preparing a research article. However, the skilled artisan will understand that the invention is not limited to such activities.

[0130] When preparing a manuscript evaluating metadata from multiple studies, the primary investigator (PI) requires integrated data, usually from multiple institutions, for his research study. If the PI has collaborators, they too require integrated data to contribute to the manuscript. In the past, this endeavor has been extremely difficult and fraught with errors due to the lack of standardization of terminology and reporting procedures, and the lack of a standardized terminology among the PI and collaborators. The present invention solves the long- felt need in the art for standardization and/or compatibility. According to the present invention, the PI first creates a Dataset and defines its member Elements. A Collaborator is a participant in the Pi's multi-site study. The Collaborator has his own data sources to contribute to the study. The PI must share with his Collaborators a Seed, containing a Dataset, its Elements and their AggregateTerms. The Collaborator submits to the PI a Review, containing the original Seed, plus Matches proposed for approval and Terms proposed for merging.

[0131] Stated another way, in this initial set-up of the process, the PI needs an integrated dataset to complete work on a particular research study. He must first define each of the Elements required for the study and collect them in a Dataset. At this point, the PI has an initial Seed, version 0. He uploads this Seed to a web-based server and shares it with a Collaborator, who then imports/downloads it locally. The Collaborator creates Terms from observations in his local sources. He then generates Scores between his Terms and the AggregateTerms of the Seed. These Scores can aid him in the discovery of Matches between Terms and Elements. When finished finding Matches for as many Elements as he can, the Collaborator uploads a Review and shares it with the PI, who then imports it on his computer. The PI must approve all Matches before merging Terms into Elements. The process repeats with each Collaborator, continually enriching the Seed with more data, making future Scores more informative and future Matches easier to discover.

[0132] The process is outlined and diagrammed in FIG. 2. Note that in the initial case, the PI may assume the role of the first Collaborator. Steps 3, 4, 8, and 9 become unnecessary because the PI is simply sharing data with himself. The user icons next to steps 1, 2, 7, 10, and 11 indicate that human judgment and inference are required in the Element definitions, Match discovery and approval, and merging of Terms.

[0133] With regard to the process in FIG. 2, the labels set forth in Table 1, below, are to be applied:

[0134] Table 1 : Collaborative Matching Process

[0135] Software Architecture and Implementation

[0136] The software that enables this collaborative data-driven matching process is composed of four major components:

[0137] 1. An API to store and deliver data models of the process, enforcing model constraints and relationships and restricting mutable operations;

[0138] 2. A framework to generate Terms;

[0139] 3. A framework to generate Scores using various statistical and machine learning algorithms; and

[0140] 4. A web application to allow discovery and review of Matches.

[0141] Observational data can reside in a database provided by the collaborative institution. This data may come in a wide variety of storage formats, schema, or underlying technologies. The Term Generation Framework aggregates metadata about the observational data, including descriptive label, units of measure, mean, standard deviation, quartiles, minimum, maximum, number of samples, and histogram. These metadata are properties of Terms, which are stored via the API. An API provides storage, retrieval, and manipulation of data necessary to support all steps of the collaborative matching process. It maintains model consistency and validity and enforces relationship constraints. The Score Generation Framework quantifies the relative similarity between pairs of Terms. Scores are calculated based on various features of Terms, such as the distribution of values or the descriptive label. Scores are stored via the API. (See Scoring Algorithms below.) The Web Application provides an interactive way to search and browse through Terms to find and create likely Matches. (See Web Application below.) The Tag Map Framework updates the observational data with the mappings created in the Web Application. As a result, the observational data may now be queried using the nomenclature of the Elements of the research study or software application Data API.

[0142] FIG. 5 illustrates an example of a Data Model according to the present invention, which includes relationships between data model objects used throughout the collaborative matching process, where FIG. 5A is an expanded view of the left side of the Data Model and FIG. 5B is an expanded view of the right side of the Data Model. The exemplary Data Model includes relationships between Dataset, Element, Term, Source, Histogram, Summary, Match, Aggregate Term, Score and Algorithm. A Dataset defines many elements. Elements are represented by Terms and AggregateTerms, whose features describe a set of observations. AggregateTerms are comprised of multiple Terms. Terms have attributes that identify the Source from which it came, a Histogram of observed values, and Summary statistics about the observations. Scores quantify the comparison of a pair of Terms. A Score is defined by an Algorithm for computation. Matches indicate mappings between Elements and Terms.

[0143] FIG. 6 illustrates an example of Software Architecture according to the present invention. As shown in FIG. 6, the Software Architecture can include an Observational Data store that can be adapted to send data to a Term Generation Framework, which can be adapted to send data to an API that can be adapted to read and write to a Metadata store. A Score Generation Framework can be adapted to read/write from/to the aforementioned API to access data from the Metadata store. A Web Application can be adapted to read/write from/to the aforementioned API to access data from the Metadata store. The API can be adapted to send mapping data to a Tag Map Framework, which can be adapted to assign new terminology to the Observational Data store.

[0144] The Data API is implemented as the stateful manipulation of resources represented in JSON over the HTTP protocol. Table 2 shows the resource endpoints in the DATA API. Each row represents a resource provided by the API. The first column indicates the URL endpoint used to access the resource. The second column indicates the action performed for a POST request sent to that URL endpoint. The third column indicates what objects are returned for a GET request. The final column indicates what object is removed for a DELETE request. [0145] Table 2: The DATA API Elements

[0146] Term Generation Framework

[0147] The Term Generation Framework aggregates recorded observations and generates Term objects. After grouping observations by a "termid" field, the following facts are collected: name; units of measure; source; histogram of numeric values; histogram of non- numeric values; summary statistics of numeric values, including number of samples, maximum and minimum values, quartiles, and mean and standard deviation.

[0148] The Term Generation Framework has the following features: adaptable to a wide variety of source data formats; configurable fields that derive aggregate facts; leverages native database aggregation features (MongoDB, SQL); uses parallelism to achieve faster completion stores Terms via the Data API.

[0149] Score Generation Framework

[0150] The Score Generation Framework calculates Scores between pairs of Terms. It has the following features: can be used with algorithms implemented in any programming language; modular and service-oriented to minimize code repetition; and uses parallelism to achieve faster completion. FIG. 3 demonstrates the Score Generation Framework with the various process modules. Given two sets of Terms from sources P and Q, the framework computes Scores for every pair of Terms across the two sets.

[0151] In Step 1, the Manager module divides the score calculation into discrete units of work known as "jobs". Two types of jobs exist: cache jobs and score jobs. The Manager module can be adapted to receive information from a list of terms from source P and/or a list of terms of source Q.

[0152] In Step 2, cache jobs retrieve intermediate and reusable information about one or more Term, and score jobs calculate the score for one pair of Terms. Jobs may be written in any programming language. The Manager sets up dependencies between jobs (score jobs depend on certain cache jobs) and inserts all jobs into a job queue.

[0153] In Step 3, workers in the worker pool process jobs that have been put in the queue.

[0154] Multiple workers can simultaneously process different jobs. Failed jobs are automatically logged and retried. Job dependencies assure that each job has all the information required to proceed.

[0155] In Step 4, cache jobs store information about each Term in an in-memory cache (Memory Store for Cache), where it can be retrieved by score jobs. Score jobs publish calculated scores to an in-memory store (Memory Store for Scores).

[0156] In Step 5, the Inserter module accumulates scores as they are published in the in-memory store (Memory Store for Scores). The Inserter inserts those scores, batches at a time, into a database on disk for persistent storage.

[0157] Time is saved by buffering and inserting in batches rather than individually.

[0158] Scoring Algorithms

[0159] Scoring algorithms are extremely valuable to the discovery of Matches. A well-designed scoring algorithm can recommend Terms with a high probability of being matched, reducing time and effort spent searching for similar Terms.

[0160] There are three main features to consider when comparing Terms: the distribution of values; the name; and the units of measure. The Score Generation Framework includes, but is not limited to, several algorithms grounded in statistics, machine learning, information theory, and information retrieval. [0161] Distribution Comparisons

[0162] Earth Mover's Distance (EMD)

[0163] EMD is a measure of the distance between two probability distributions over a region D. In mathematics, this is known as the Wasserstein metric. Informally, if the distributions are interpreted as two different ways of piling up a certain amount of dirt over the region D, the EMD is the minimum cost of turning one pile into the other; where the cost is assumed to be amount of dirt moved times the distance by which it is moved. A perfect EMD score for identical histograms should be zero.

[0164] A histogram can be used to compare a distribution of observed values, provide a visual representation of the data and statistics related to the same. A common label can be used to accommodate for difference in spelling, abbreviations and synonyms. The common label can be generated using Jaro-Winkler and/or UMLS Metathesaurus. Units can be identified using simple matching or more advanced methods such as those used for the labels.

[0165] The following is a simplification of the EMD calculation for two histograms, A and B, where A, and B_z are the i-th bins in the two histograms:

[0166] Kullback-Leibler Divergence (KLD)

[0167] KLD is a non- symmetric measure of the difference between two probability distributions P and Q. Specifically, the Kullback-Leibler divergence of Q from P, denoted D_KL(P| |Q), is a measure of the information lost when Q is used to approximate P. It measures the expected number of extra bits required to code samples from P when using a code based on Q, rather than using a code based on P. A perfect KLD score for identical histograms should be zero. [0168] Given that p(x) is the estimated probability that value x was observed for Term P, the KLD can be calculated as:

for all observed values _?,

! where p(x) = C_¾ / N_w and C_¾ is the count for the bin containing x, N is the total number of samples in P, and w is the bin width of the histogram.

[0169] Name Comparisons

[0170] Because Terms can have similar names associated with them, they can be matched using string-edit or token-based distance techniques. Similarity functions syntactically compare a pair of strings s and t and assign a real number r as a similarity score. A larger value of r reflects great similarity between s and t.

[0171] A semantic comparison between a pair can also be done by obtaining synonyms of string s and using a similarity function to get the best score between the synonyms of s and string r.

[0172] The Unified Medical Language System (UMLS) is a comprehensive repository of over 60 biomedical vocabularies developed by the US National Library of Medicine. The system includes a Metathesaurus of source vocabularies and a UMLS API with a feature to query for biomedical terms and their relationships in the ontologies. The API is queried for possible synonyms from the Metathesaurus for each Term in a dataset and the synonyms are in turn utilized for semantic comparison to get a similarity score between term names.

[0173] The SecondString project, developed at Carnegie Mellon University, is a Java toolkit of string-matching techniques that includes hybrid similarity functions that can be used for a semantic comparison. The present invention, in its Score Generation Framework, contains a semantic name matching technique based on unique biomedical concepts made available by UMLS API.

[0174] Level 2 Jaro-Winkler (L2JW) with UMLS

[0175] The algorithm leverages SecondString 's implementation of the Level2Jaro- Winkler hybrid distance technique and the synonym results returned by the UMLS API. The flowchart in FIG. 4 outlines the process to calculate a similarity score for a term pair (p,q), where p and q are the name strings. [0176] The process to calculate a similarity score for a term pair can include a first step including the following steps: Use UMLS API to find synonyms for p and q; pSynonyms = Query UMLS Metathesaurus with search string p; and qSynonyms = Query UMLS Metathesaurus with search string q.

[0177] The process to calculate a similarity score for a term pair can include a second step including the following steps: Create synonym lists for p and q; pSynonymList = pSynonyms + p; and qSynonyrimList = qSynonyms + q.

[0178] The process to calculate a similarity score for a term pair can include a third step including the following steps: Compute Level2Jaro Winkler similarity scores between p string and strings in qSynonymList; Select the highest score from the pair comparisons; pMax = 0 for string in gSynonymList; score = Level2JaroVVinkler.compute(p, string); pMax = Max(score, pMax); and end.

[0179] The process to calculate a similarity score for a term pair can include a fourth step including the following steps: Compute Level2Jaro Winkler similarity scores between q string and strings in pSynonymList; Select the highest score from the pair comparisons; qMax = 0 for string in pSynonymList; score = Level2JaroWinkler.compute(q,string); qMax = Max(score, qMax); and end.

[0180] The process to calculate a similarity score for a term pair can include a fifth step including the following steps: The Average of the highest scores is the similarity score for the term name pair (p, q); and similarity Score = (pMax + qMax) / 2.0.

[0181] Level 2 Jaro- Winkler is defined as follows: To compare two long strings s and t, the strings are first broken down to substrings s = ai...a_K and t = bi...b_L. The similarity is defined by

where sim ' is the Jaro-Winkler distance function.

[0182] Combining L2JW with the UMLS Metathesaurus incorporates both lexical and semantic similarity when comparing two strings. Given two strings p and q, we obtain the set of synonyms from the UMLS Metathesaurus, pList and qList. We define pMax as the maximum Level2Jaro Winkler score between p and the strings in (q + qList), and qMax as the max score between q and the strings in (p + pList). The final score is the average of pMax and qMax.

[0183] Units of Measure Comparison

[0184] Terms that observe the same entity will usually have the same units of measure. This indicates a perfect match with a score of 1. For a particular entity, compatible units of measure may include (in decreasing order similarity): one widely accepted "standard" unit for that particular entity; several different spellings of the standard unit; several commonly used alternative units; any other unit measuring the same quantity (time, length, mass, volume, etc.); and missing units, possibly assuming the standard unit. Using this information, a similarity score between units of measure is computed.

[0185] Composite Comparison (see, also, the "Match Prediction Score" section below)

[0186] The aforementioned scores each measure the pairwise similarity (or dissimilarity) of a single feature: distribution, name, or units. A composite score that combines these individual feature scores can give a more comprehensive representation of similarity. Although any number of methods can be used, two exemplary methods of combining feature scores are disclosed herein: Linear Combination and Naive Bayes.

[0187] Linear Combination

[0188] The distribution score D, name score N, and units score U can be combined as a weighted sum, i.e., composite score = W_DD + WNN + WJJU. This method is simple to implement and comprehend, and is useful as a first approach.

[0189] Naive Bayes

[0190] A more sophisticated method of combining individual scores is to use a Naive Bayes probabilistic model, where the features are the individual scores D, N, and U and the task is classifying a pair of terms as a match or non-match. There are two important benefits of using this approach: 1) the model outputs a composite score (between 0 and 1) that has a meaningful interpretation - the probability that a pair of terms is a match; and 2) the model improves with more training data, so as more institutions contribute data, the model becomes more accurate.

[0191] In a nutshell, a Naive Bayes model first "learns" to associate certain values of score D, N, and U with matching terms, and other values with non-matching terms. When presented with a new set of score values, it estimates how likely these scores came from a matching pair of terms, having previously learned what range of values to expect. [0192] Sometimes values in the histograms are discovered to be incorrect. Terms can be regenerated to allow conversion of units and customized parsing of values.

[0193] Web Application

[0194] The Web Application ("the app") has two modes of operation: discovery of Matches, used by the Collaborator, and review of Matches, used by the PI.

[0195] The Web Application operates in two modes depending on the role of the user: Collaborator mode (an example of which is illustrated in FIG. 7) and Principal Investigator mode. It is designed to run independently on the Collaborator's computer and the Principal Investigator's computer. The majority of usage occurs in Collaborator mode. Users start by uploading a Seed, selecting a Dataset of Target Elements, and selecting a set of Terms. The Collaborator selects an Element and can begin searching for matching Terms. The discovery of matching Terms is greatly facilitated by the ability to sort Terms by their Scores. The user may also sort by name and search by partial strings or regular expressions. The list of Terms instantly updates with the results of querying and sorting. The histograms of the current selections are displayed for quick inspection of distribution similarity. Statistics are also available for review as well. Matches created between Terms and Elements are displayed at the top and can be reviewed or deleted. Finally, the Collaborator can download a Review. In Principal Investigator Mode, the PI first uploads a Review and can then select a Dataset of Elements. The list of Matches is displayed. The PI clicks a Match to inspect the histograms and statistics to verify similarity. After approving the Matches he agrees with, the PI performs a merge. This results in a new Dataset with new Elements that contain aggregate data from the matched Terms.

[0196] FIG. 7 illustrates an example of a screenshot of a Web Application according to the present invention, where FIG. 7A is an expanded view of the left side of the screenshot and FIG. 7B is an expanded view of the right side of the screenshot.

[0197] The Web Application can comprise a graphical user interface (GUI) 700 such as that shown, for example, in FIG. 7. The GUI 700 can comprise one or more of the following: a mode field 702 (in the example shown in FIG. 7, the GUI is in Collaborator Mode, and the GUI can also be provided in Principal Investigator Mode (not shown)); a Target Elements field 704; a Target Elements count field 706 ("73 items" in this example); a Matches button 708; a Target Elements database name field 710 ("VPS Knode" in this example); a button 712 for selecting one database from a list of databases; a download button 714; an upload button 716; a drop down menu 718 including a list of Target Elements ("Glucose Whole Blood - (mg/dL)" in this example); a Source Terms field 720; a Source Terms count field 722 ("906 items" in this example); an Unmatch button 724; a Source Terms database name field 726 ("ism_new_chart_events" in this example); a search field 728 (with the text "glu" in the field in this example); a Source Terms count and sort indicator field 730 ("8 of 906 shown, sorted by EMD" in this example, which indicates that 8 of 906 Source Terms are displayed in the field below, which are sorted by Earth Mover's Distance (EMD)); a Source Terms display field 732 (in this example, "Glucose (POC) ₍...)", "Glucose (Lab-POC) _(nuii)", "Glucose _(nuii)", "CSF Glucose (null)", "Urine Glucose (_nuii)", "BodyFluid Glucose (_nuii)", "Glucose BF (_nuu)" and "UA Glucose (_nuii)" are displayed); a numeric field 734 (displaying the EMD for each of the displayed Source Terms in field 732; in this example, the values of 2.603, 6.882, 10.94, 15.95, 17.66, 28.42, 40.16 and 1.798e+308 are displayed); a Delete All button 736; a Target Element header field 738; a Source Term header field 740; a Delete Target Element button 742; a Target Element field 744 ("Non- Invasive Systolic Blood Pressure", "Arterial Systolic Blood Pressure", "Respiratory Rate", "Glucose Whole Blood", "Glucose Serum", "Ionized Calcium" and "Total Calcium" are shown in this example); a Source Term field 746 (the Target Elements of "Non-Invasive Systolic Blood Pressure", "Arterial Systolic Blood Pressure", "Respiratory Rate", "Glucose Whole Blood", "Glucose Serum", "Ionized Calcium" and "Total Calcium" correspond with the Source Terms "NIBP (mmHg)", "Arterial BP (mmHg)", "Resp. Rate (bpm)", "Glucose (POC)", "Glucose", "Calcium Ionized" and "Calcium Total", respectively, which are displayed in this example); a source field 748 ("source: ism new chart events" in this example); a visual representation tab 750 ("Histogram" in this example); a Statistics tab 752; a visual representation of data 754, such as one or more histograms; a Target Element field 756 ("Glucose Whole Blood" in this example); a numeric count of the Target Element 758 ("n = 12893" in this example); a first key 760 (which can be color-coded); a Source Term field 762 ("Glucose (POC)" in this example); a numeric count of the Source Term 764 ("n = 43420" in this example); and a second key 766 (which can be color-coded).

[0198] In this example, the visual representation of data 754 can comprise a histogram where the x-axis is a value of a glucose measurement, the relatively darker histogram can correspond with the Target Element of "Glucose Whole Blood - (mg/dL)" and the relatively lighter histogram can correspond with the Source Term "Glucose (POC) ₍...₎". As such, for example, the peak of the histogram for the Target Element of "Glucose Whole Blood - (mg/dL)" is a glucose of 90 mg/dL. The y-axis is a density of the distribution for a given value along the y-axis. Assuming the width of the bar is 1 mg/dL, (0.016 * 1) = 0.016, which is the fraction of observations that were 90 mg/dL. So, in this example, about 1.6% of all the "Glucose Whole Blood - (mg/dL)" measurements in the entire "VPS Knode" database had a value of 90 mg/dL. If the entire x-axis is displayed, unless there is missing data, the sum of the areas of the bars should equal 1 (100%).

[0199] FIG. 8 illustrates an example of information flow 800 according to the present invention. Specifically, for example, a first database 810 can be adapted to send a first dataset 820 to a data module 830, which can be adapted to output a second dataset 840, which can be added to an aggregate database 850, which can be adapted to send information back into the data module 830. The first database 810 can be an EMR database from a hospital. The first dataset 820 can include a label, data, a histogram and a unit. The second dataset 840 can include a plurality of Labels A, B, C, D ... and a corresponding plurality of Target Labels A, B, C, D .... The aggregate database 850 can include aggregate data, a label, data (including a histogram) and a unit. The second dataset 840 can comprise a mapping file between an element's label in a hospital EMR such as first dataset 810 and its label in a target store, repository, research request, ontology or the like. The hospital or a customer can use the mapping to build a custom query to extract data and construct the data in a usable way.

[0200] FIG. 9 illustrates an example of a screenshot of a Web Application according to the present invention, where FIG. 9A is an expanded view of the left side of the screenshot and FIG. 9B is an expanded view of the right side of the screenshot. In overall structure and function, the screenshot shown in FIG. 9 is similar to that shown in FIG. 7.

[0201] The Web Application can comprise a graphical user interface (GUI) 900 such as that shown, for example, in FIG. 9. The GUI 900 can comprise one or more of the following: a mode field (not shown, but similar to mode field 702 in FIG. 7); a Target Elements field 904; a Target Elements count field 906 ("47 items" in this example); a Matches button 908; a Target Elements database name field 910 ("CHLA Cerner" in this example); a button 912 for selecting one database from a list of databases; a download button 914; an upload button 916; a drop down menu 918 including a list of Target Elements ("Base Excess Arterial - (MEq/L)" in this example); a Source Terms field 920; a Source Terms count field 922 ("906 items" in this example); an Unmatch button 924; a Source Terms database name field 926 ("ism_new_chart_events" in this example); a search field 928 (with the default text "Search for terms..." in the field in this example); a Source Terms count and sort indicator field 930 ("906 of 906 shown, sorted by EMD" in this example, which indicates that 906 of 906 Source Terms are displayed in the field below, which are sorted by Earth Mover's Distance (EMD)); a Source Terms display field 932 (in this example, "BEmv (mEq/1) (_mE_q/i)", "BE (post) (mEq/1) (_nuii)", "BEv (mEq/1) (mEq/i)", "BEa (mEq/1) (mE_q/i)", "BE (art. monitor) (mE_q/i)", "CSF Band % _(null)", "Creatinine BF (_nuii)", "CSF Bands _(nuii)", "Vitamin E Beta-Gamma _(nuU)", "BE (pre) (mEq/1) _(nuU)", "CMV IGG (null)", "BE (BG)mEq/l. (_mE_q/i)", "Gent Level (_nuii)", "Lithium Level (_nuii)", "Herpes Simplex (null)", "UA Volume _(null)", "Chromium P _(null)", "Gentamicin _(null)", "Tobra Level _(null)", "Tobramycin _(nuU)", "LP # of attempts ₍ ", "BodyFluid Bands _(nuU)", "RSBI . (...)", "Age _(days)" and "BF Band % (_nuii)" are displayed); a numeric field 934 (displaying the EMD for each of the displayed Source Terms in field 932; in this example, the values of 0.5506, 1 .255 , 1 .854, 1 .875 , 2.910, 3.440, 3.656, 3.907, 3.937, 3.953 , 4.009, 4. 1 16, 4.261 , 4.280, 4.330, 4.344, 4.357, 4.485 , 4.540, 4.63 1 , 4.837, 4.901 , 4.941 , 5.006 and 5. 193 are displayed); a Delete All button 936; a Target Element header field 938; a Source Term header field 940; a Delete Target Element button 942; a Target Element field 944 ("PTT", "Albumin", "Aspartate Transaminase" and "Base Excess Arterial" are shown in this example); a Source Term field 946 (the Target Elements of "PTT", "Albumin", "Aspartate Transaminase" and "Base Excess Arterial" correspond with the Source Terms "PTT", "Albumin", "Plasma Hgb" and "BEmv (mEq/1)", respectively, which are displayed in this example); a source field 948 ("source: ism new chart events" in this example); a visual representation tab 950 ("Histogram" in this example); a Statistics tab 952; a visual representation of data 954, such as one or more histograms; a Target Element field 956 ("Base Excess Arterial" in this example); a numeric count of the Target Element 958 ("n = 6366" in this example); a first key 960 (which can be color-coded); a Source Term field 962 ("BEmv (mEq/1)" in this example); a numeric count of the Source Term 964 ("n = 2323" in this example); and a second key 966 (which can be color- coded).

[0202] The present invention includes additional uses including further data visualization tools. The present invention can include automation of data input into repositories, can support applications, can enable sharing of datasets and can provide a reproducible dataset for research from multiple sources.

[0203] FIG. 10A depicts a first example of output according to the present invention. Specifically, FIG. 10A is an example of a pie chart of "Mortality" with the percentage of "lived" and "died" plotted therein.

[0204] FIG. 10B depicts a second example of output according to the present invention. Specifically, FIG. 10B is an example of a pie chart of "Race" with the percentage of "other", "latino", "white", "black", "unknown", "pacific", "islander", "asian" and "japanese" plotted therein. [0205] FIG. IOC depicts a third example of output according to the present invention. Specifically, FIG. IOC is an example of a pie chart of "Gender" with the percentage of "male" and "female" plotted therein.

[0206] FIG. 10D depicts a fourth example of output according to the present invention. Specifically, FIG. 10D is an example of a plot of "Length of Stay (in Days)" with "Days" along the x-axis (labeled from 0 to 200 days in increments of 100 days) and "Number of Patients" along the y-axis (labeled from 0 to 25 patients in increments of 5 patients).

[0207] FIG. 10E depicts a fifth example of output according to the present invention. Specifically, FIG. 10E is an example of a plot of "Number of Admits by Month" with "Month" along the x-axis (labeled for each of the 12 months) and "Number of Admits" along the y-axis (labeled from 0 to 40 patients in increments of 10 patients).

[0208] FIG. 10F depicts a sixth example of output according to the present invention. Specifically, FIG. 10F is an example of a plot of "Number of Admits by Year" with "Year" along the x-axis (labeled for each of the years from 2008 to 2013 in increments of 1 year) and "Number of Admits" along the y-axis (labeled from 0 to 150 patients in increments of 25 patients).

[0209] FIG. 10G depicts a seventh example of output according to the present invention. Specifically, FIG. 10G is an example of a plot of "Top Ten Diagnoses" with the names of the Top Diagnoses along the x-axis (i.e., "SCOLIOSIS", "TUMOR - Cerebral", "RESPIRATORY DISTRESS - Other", "RESPIRATORY FAILURE", "DEVELOPMENTAL DELAY", "SEIZURE DISORDER", "GENETIC ABNORMALITY", "TRAUMA - Head", "SHOCK - Septic" and "SEPSIS - Other") and "Number of Patients" along the y-axis (labeled from 0 to 30 patients in increments of 10 patients).

[0210] FIG. 11 illustrates an example of a screenshot of a Web Application according to the present invention, where FIG. 11A is an expanded view of the left side of the screenshot and FIG. 11B is an expanded view of the right side of the screenshot. In overall structure and function, the screenshot shown in FIG. 11 is similar to that shown in FIGS. 7 and 9.

[0211] The Web Application can comprise a graphical user interface (GUI) 1100 such as that shown, for example, in FIG. 11. The GUI 1100 can comprise one or more of the following: a mode field (not shown, but similar to mode field 702 in FIG. 7); a Target Elements field 1104; a Target Elements count field 1106 ("100 items" in this example); a Info Window button 1108; a Target Elements database name field 1110 (default text "select a dataset" in this example); a button 1112 for selecting one database from a list of databases; a download button (not shown, but similar to download button 714); an upload button (not shown, but similar to upload button 716); a search field 1118 (default text "search an element" in this example); a Source Terms field 1120; a Source Terms count field 1122 ("130 items" in this example); an Matches button 1124 with a sort indicator; a Source Terms database name field 1126 (default text "Select a source" in this example); a search field 1128 (with the default text "Search for terms" in the field in this example); a Source Terms count and sort indicator field (not shown, but similar to Source Terms count and sort indicator field 730); a Source Terms display field 1132 (in this example, "1. element One", "2. element Two", "3. element Three", "4. element" ... "29. element" are displayed); a numeric field (not shown, but similar to numeric field 734); a vertical slider 1135; a visual representation tab 1150 ("Histogram" in this example); a Statistics tab 1152; a visual representation of data 1154, such as one or more histograms; a Comparison View tab 1170; a Statistics tab 1172; a matches tab 1174; a first label field 1176 ("EMD - Histogram" in this example); an element field 1178 (in this example, "element One", "element Two", "element Three", "element" ... "element" are displayed); a vertical slider 1180; a second label field 1182 ("L2JW - Label" in this example); an element field 1184 (in this example, "element One", "element Two", "element Three", "element" ... "element" are displayed); a vertical slider 1186; a third label field 1188 ("Matches" in this example); a Target Element field 1190 (in this example, a header "Target Element" and Target Elements "Hemog", "WBC" and "HR" are displayed); a Source Term field 1192 (in this example, a header "Source Term" and Source Terms "Hemoglobin", "WBC" and "Pulse" are displayed, which correspond with Target Elements "Hemog", "WBC" and "HR", respectively); a vertical slider 1194; a minimize/maximize button 1196 (shown in "minimize" mode in this example); a horizontal slider 1198; and a resize button 1199.

[0212] FIG. 12 illustrates an example of a screenshot of a Web Application according to the present invention, where FIG. 12A is an expanded view of the left side of the screenshot and FIG. 12B is an expanded view of the right side of the screenshot. FIG. 13 illustrates an example of a screenshot of a Web Application according to the present invention, where FIG. 13A is an expanded view of the left side of the screenshot and FIG. 13B is an expanded view of the right side of the screenshot. FIG. 12 and FIG. 13 can be from the same embodiment of the present invention. For example, in FIG. 12, the Similarity Plot tab 1251 and the Histogram tab 1250 are selected; whereas, in FIG. 13, the Matches List tab 1253 and the Statistics tab 1252 are selected. [0213] The Web Application can comprise a graphical user interface (GUI) 1200 such as that shown, for example, in FIG. 12 and in FIG. 13. The GUI 1200 can comprise one or more of the following: a mode field 1202 (in the example shown in FIG. 12, the GUI is in Collaborator Mode, and the GUI can also be provided in Principal Investigator Mode (not shown)); a Target Elements field 1204; a Target Elements count field 1206 ("47 items" in this example); a Matches button (not shown, but similar to Matches button 708 in FIG. 7); a Target Elements database name field 1210 ("CHLA Cerner" in this example); a button 1212 for selecting one database from a list of databases; a download button 1214; an upload button 1216; a drop down menu 1218 including a list of Target Elements ("Albumin - (g/dL)" in this example); a Source Terms field 1220; a Source Terms count field 1222 ("906 items" in this example); an Unmatch button 1224; a Source Terms database name field 1226 ("ism_new_chart_events" in this example); a search field 1228 (with the default text "Search for terms. .." in the field in this example); a Source Terms count and sort indicator field 1230 ("906 of 906 shown, sorted by PRED" in this example, which indicates that 906 of 906 Source Terms are displayed in the field below, which are sorted by an algorithm such as PRED, which is described in greater detail below with reference, for example, to FIG. 14); a Source Terms display field 1232 (in this example, "Albumin _(nuii₎", "Normal Saline (ml) ₍...₎", "ServoPressure (cmH20) "Potassium (BG) (_nuii)", "Myelocyte% M (_nuii)", "Gastric pH (_nuu)' ', "Crs (LunglnSc). ₍ ", "Myelocytes _(nuU)", "# of Units-PRBC's ₍...₎", "Perfusion ₍ ", "I:E Ratio (l :xx) ₍...₎", "CSF Band % _(nuii₎", "Metamyelocytes _(nuU)", "CI (I/min/M2). ₍...₎", "Today's Weight (kg) _(kg)", "AST (SGOT) _(nuii₎", "NG/GT Residual (cc) _(nuii₎", "ALT (SGPT) _(nuU)", "ETT Moved Down (cm) (...)", "Calcium Total (_nuii)", "# pillow orthopnea (_nuii)", "Osmolality (Serum) (_nuii)", "Cap Refill (sec) _(sec)", "ServoPressure(cmH20) _(cmH20)", "ExpiredMinuteVol (1) ₍.. ", "TotalProtein (Serum) _(nuii₎" and "ST segment: ST1 ₍...₎" are displayed); a numeric field 1234 (displaying the PRED for each of the displayed Source Terms in field 1232; in this example, the values of 1.000, 0.2504. 0.005716, 0.004425, 0.002888, 0.002432, 0.002379, 0.001981 , 0.001323, 0.0006291 , 0.0006035, 0.0005570, 0.0004333, 0.0002849, 0.0002656, 0.0002570, 0.0001813, 0.0001746, 0.0001552, 0.0001 1 19, 0.0001064, 0.00008453, 0.00006810, 0.00005421 , 0.00004818, 0.00004368 and 0.00004178 are displayed); a visual representation tab 1250, which is selected in FIG. 12 ("Histogram" in this example); a Similarity Plot tab 1251, which is selected in FIG. 12; a Statistics tab 1252, which is selected in FIG. 13; a Matches List tab 1253, which is selected in FIG. 13. [0214] For example, when the Similarity Plot tab 1251 is selected, shown in FIG. 12, one or more of the following can be displayed: a visual representation of data 1255, such as one or more scatterplots; a radio button 1257 adapted to allow display of "Source Only" information; a filter field 1259; and a Load more button 1261.

[0215] Scatterplot Navigation

[0216] The Web Application UI can include a navigation feature, which can be called a scatterplot. When the user selects a target element, an interactive scatterplot can be generated that allows the user to find source terms that are most similar to the selected target element. Each dot in the plot can represent a source term. The dot's position can be determined using two different scores as coordinates on each axis. The dot's radius can be determined by the number of observations for that source term. A perfect match between the source term and target element would render the dot at the origin. The user can quickly determine which source terms are similar to the selected target element by visually inspecting distances from the origin. A larger radius can indicate a source term that is measured often and is more likely to be a useful or interesting term. The scatterplot can allow visual inspection of, for example, three variables simultaneously (two scores that determine position and the number of observations that determines size) to choose a match among the best candidates.

[0217] Other UI features improve the readability of the scatterplot. Each dot can be labeled with the term's name. The dots can be colored according to distance from the origin. In addition, arcs can be drawn to indicate relative distance from the origin (like elevation lines on a contour map). The plot can be zoomable so the user can better view clustered dots. The user can type in a search string and dots that do not match can fade away. When the user clicks a dot, the dot can be highlighted and the corresponding source term can be selected, and the histograms and statistics can be rendered in the main UI frame.

[0218] In various embodiments, the EMD and L2JW scores can be used to determine x-y position. Any two scores can be used, as long as they are transformed such that the "most similar" score is positioned at 0. The following transformations can be used for EMD and L2JW:

[0219] x = logl0(EMD * 10)

[0220] y = (1 - L2JW) * 5

[0221] Note that when EMD = 0.1, x = 0; and when L2JW = 1, y = 0. Also note that an L2JW of 0 is roughly as far as an EMD of 10⁴.

[0222] In various embodiments, the dot radius can be proportional to the log of the number of observations. [0223] As shown, for example, in FIG. 12, the x-axis of the scatter plot can be EMD on a logarithmic scale, in this example, the scale is from 0.1 to 100000 divided logarithmically (i.e., 0.1, 1, 10, 100, 1000, 10000 and 100000), and the y-axis of the scatter plot can be L2JW on a linear scale, in this example, the scale is from 1.0 to 0.0 with equal increments of 0.1 therebetween.

[0224] For example, when the Matches List tab 1253 is selected, shown in FIG. 13, one or more of the following can be displayed: a Delete All button 1236; a Target Element header field 1238; a Source Term header field 1240; a Delete Target Element button 1242; a Target Element field 1244 ("Albumin", "Aspartate Transaminase", "Base Excess Arterial", "Base Excell Capillary", "Base Excess Venous", "Bicarbonate", "Bilirubin", "Blood Urea Nitrogen", "C-reactive protein", "Creatinine" and "Diastolic Blood Pressure" are shown in this example); a Source Term field 1246 (the Target Elements of "Albumin", "Aspartate Transaminase", "Base Excess Arterial", "Base Excell Capillary", "Base Excess Venous", "Bicarbonate", "Bilirubin", "Blood Urea Nitrogen", "C-reactive protein", "Creatinine" and "Diastolic Blood Pressure" correspond with the Source Terms "Albumin", "AST (SGOT)", "Bea (mEq/1)", "BEc (mEq/1)", "Bev (mEq/1)", "Total C02 (meas)", "Bilirubin. Total", "BUN", "C- Reactive Protein", "Creatinine" and "NIBP (mmHg)", respectively, which are displayed in this example); a source field 1248 ("source: ism new chart events" in this example).

[0225] For example, when the visual representation tab 1250 is selected, shown in FIG. 12, one or more of the following can be displayed: a visual representation of data 1254, such as one or more histograms; a Target Element field 1256 ("Albumin" in this example); a numeric count of the Target Element 1258 ("n = 4039" in this example); a first key 1260 (which can be color-coded); a Source Term field 1262 ("Albumin" in this example); a numeric count of the Source Term 1264 ("n = 52090" in this example); and a second key 1266 (which can be color-coded).

[0226] For example, when the Statistics tab 1252 is selected, shown in FIG. 13, one or more of the following can be displayed: a Target Element field 1256 ("Albumin" in this example); a first key 1260 (which can be color-coded); a Source Term field 1262 ("Albumin" in this example); a second key 1266 (which can be color-coded); a View Scores button 1265; a first statistical information field 1267 corresponding with the Target Element; and a second statistical information field 1269 corresponding with the Source Term. In this example, the Number of samples, Minimum, Maximum, Mean, Standard Deviation, 1^st Quartile, 2^nd Quartile and 3^rd Quartile are displayed side-by-side for the Target Element and the Source Term with appropriate descriptive labels between the first and second statistical information fields 1267 and 1269. Any other suitable display of statistical information may be provided.

[0227] Each of the GUIs 700, 900, 1100 and 1200 can be adapted so that, as a user selects a specific Target Element and/or a specific Source Term, and/or between different databases, and/or between different means of predicting the likelihood of a match (such as EMD or PRED), on the left side of the GUI, the resulting information (such as histograms and/or scatterplots and/or statistical information and the like) displayed on the right side of the GUI updates accordingly. In essence, there are various means to search, browse, select databases, target elements, and source terms. There are GUI elements that aid such actions: selectable dropdowns, selectable/sortable/filterable lists, and dynamic visualizations that indicate data features and relationships. As such, the GUIs 700, 900, 1100 and 1200 allow a user to quickly compare information relating to different combinations of Target Elements and Source Terms, which aids the user in making an informed decision as to what constitutes an appropriate match between Target Elements and Source Terms. These GUI elements make it easier to discover the best match and all other appropriate matches that may not have been apparent without the use of this invention. Although the present examples involve Target Elements and Source Terms, any two types of data can be compared in this manner.

[0228] Match Prediction Score

[0229] In various embodiments, machine learning algorithms can be used to help determine whether two terms from different sources should be matched. Pairs of terms have several features that may influence a decision to declare this pair a match. For each pair, a vector of features can be generated. Some of these pairs can be manually identified as matches, but the vast majority of pairs are non-matches. According to the present invention, a machine learning classifier can be trained on all of these feature vectors so that the machine learning classifier learns which ones are matches and which are not. Then the classifier can be used on any new/unseen feature vectors to predict whether it is a match or not, and to provide the probability of being a match. This probability can also be interpreted as another kind of Score between the term pair.

[0230] The raw observational data (e.g., Heart Rate of 60bpm at 12pm) and computed aggregate characteristics (e.g., histogram, average, std dev, quartiles, label, units, number of observations) can be collected. The aggregate characteristics (which can be called "Terms") can be compared with other aggregate characteristics in various ways. Some of these comparisons can produce Scores (e.g., EMD score for histogram, L2JW score for lexical and semantic similarity). Scores can be taken, plus other types of data, and feature vectors can be created that can represent the pair of Terms. From the feature vectors, a classifier can be trained. Finally, the classifier is used to predict whether an unseen feature vector is a Match or not.

[0231] In other words, Observational information can be used to generate a Term (Aggregate data), which can be used to generate Features (such as a Term Pair), which can be used to generate a Classifier, which can be used to determine Match Probability.

[0232] FIG. 14 illustrates an example of a flow diagram for a match prediction score system 1400 according to an embodiment of the present invention. In FIG. 14, for example, each of modules 1403, 1406, 1409, 1423, 1426, 1427, 1429, 1443, 1449, 1456, 1459, 1463, 1476 and 1479 can be a form of data such as a database or dataset. Each of modules 1413, 1416 and 1419 can be a Term Generation component. Each of modules 1433 and 1439 can be a Score Generation component. Module 1453 can be a Web UI component, which can include a manual process. The system 1400 can include a Feature Generation component. Module 1466 can be a Training component. Module 1469 can be a Prediction component. The arrows indicate examples of the flow of data. Each of modules 1423, 1426, 1427, 1429, 1443, 1449, 1456, 1459, 1463 and 1479 can be served by the API (examples of which are described in greater detail above). Each of modules 1413, 1416, 1419, 1433, 1439, 1466 and 1469 can be executed by a developer. In each of modules 1413, 1416, 1419, 1423, 1426, 1427, 1429, 1433, 1439, 1443, 1449, 1453, 1456, 1459, 1463, 1466, 1469, 1476 and 1479 information contained in parentheses represent underlying data format or implementation.

[0233] Data source A can be used as an initial set of terms. Suppose the process of matching terms from data source B to the terms in the present study is complete (meaning Terms, Scores and Matches are generated for the B-to-A mapping). This information can next be used to predict matches in a new data source C.

[0234] AB Features

[0235] Given term a from source A and term b from source B, when comparing the pair of terms a and b, several features can indicate different aspects of their similarity. Given the vector of feature values, the pair of a and b can be classified as either Match or Non-Match. Or, the probability that a and b are a Match can be calculated.

[0236] The features can include one or more of the following: (1) Histogram similarity (how similar are the distributions of observed values?) including, for example, Earth Mover's Distance EMD ( a , b ) , Kolmogorov-Smirnoff 2-sample test, and Anderson-Darling 2- sample test; (2) Semantic and lexical similarity (how similar are the names in meaning and appearance?)(it is noted that the lexical similarity and semantic similarity can be separated into two individual features) including, for example, Level2 Jaro-Winkler/UMLS score L2JW (a, b) ; (3) Units of measure similarity (how similar are the units of measure?) including, for example, Simple dictionary-based 4-level score UNI S (a, b) ; and (4) Prevalence similarity (are the two terms similarly prevalent in their respective sources?) including, for example, Absolute difference of the proportional log count abs(plc(a) -- plc(b)), and Proportional log count pic (b) .

[0237] Rank as a feature

[0238] In addition to using EMD and L2JW "distances" as a similarity measure, EMD and L2JW ranks can be used. An absolute EMD distance of 15 between term a and b may appear to be a "poor" match, since the perfect EMD score is 0. But among all EMD scores where term a is held constant and computed against every term from source B, a score of 15 may actually be one of the best EMD scores (when scoped by that term a). The same goes for L2JW scores and ranks. The rank as a fraction between 0 and 1 can be normalized so that sources with different cardinality can be comparable.

[0239] Aside: Prevalence similarity

[0240] Prevalence of a term is a measure of how often observations of that term occur, relative to the source from which the term came. For example, Heart Rate observations account for 2% of all observations in an EHR.

[0241] Prevalence similarity attempts to compare term a's prevalence in source A with term b's prevalence in source B. The hypothesis is that, for example, Heart Rate will have nearly the same prevalence, regardless of the source. What is a good way to quantify prevalence of a term in a way that is agnostic to the source from which it came? One solution includes the proportion of occurrences:

# of occurrences of term

proportional count

total # of observations

[0242] However, the density plot of term occurrences is extremely small in magnitude and long tailed. This means that (1) even the most prevalent terms represent only a tiny fraction of the entire data source, (2) term occurrences vary wildly from term to term, and (3) most proportions are extremely small. To mitigate these factors, the "proportional log of the count" can be used: log ( # o f occurrences o f term a )

log ( # of occurrences o f mo st prevalent term in source A )

[0243] Ifpic (a) is similar to pic (b) , then they are more likely to be a Match, so abs(plc(a) -- plc(b)) can be used as a feature of prevalence similarity. One drawback to this method is that the denominator ofp1c ( a) is not always known. Term a can be taken out of its original source A and included in a Dataset of interesting terms for a study. Furthermore, term a can be the result of merging terms from multiple sources, and since the count of occurrences can be stored, it is difficult to calculate pic ( a) that is comparable to other pic's.

[0244] Interestingly, pic (b) has proven to be a useful feature by itself. Studies often include terms with high prevalence. Take Heart Rate, for example: clinicians measure it often because it helps decision making; studies include it, also because it helps decision making, but in addition analysis may require a high volume of observations. Consequently, a candidate term b with high prevalence is more likely to be included in a Match pair of terms because Matches identify terms that are often included in studies. Using pic (b) can relieve difficulties associated with choosing the denominator for pic (a) .

[0245] Feature Data

[0246] To generate feature data, a new API endpoint can be created. The DatasetID (source A) and the SourcelD (source B) can be specified.

GET /scores/: dataset_id/source_id

[0247] For example, http://localhost: 6789/api /scores /51a52272291b6dc07a0030fa/ism_ne w_chart_events. The response is a csv file with MIME type text/csv.

[0248] The columns of this file can be as set forth in Table 3 as follows: [0249] Table 3

l¾ units

UNrrS(t1 ,t¾ 'mRt ft or Yrerr optional: String Wfc lng

whether the user created a

1 Best: ί Best: 4 Match for this par:, Must give

Worst: 0 Worst: 0 query parameter r ;?

i clude :¾at e¾-tr i; !fl API request

[0250] Training the Classifier

[0251] A classification model can be trained using the feature data to label term pairs as either Match or Non-Match, and a probability of being a Match can be assigned.

[0252] The script scripts/classification/trainModel . R is provided for this task. It takes one argument: datafile - (string, required) File path or URL of feature data. The feature data API endpoint can be hit directly, or the feature data can be saved using cURL/wget.

[0253] The script saves a model in the RData binary format as model . RData. A different classification model can be used, this script can be updated and the new model can be saved in the same RData format.

[0254] The terms log (EMD) , rank (EMD) , L2JW, rank(L2JW), and PLC (b) can be used as features in a Naive Bayes classifier using a gaussian kernel density estimate for the conditional probabilities. By experimentation, this model gave the best sensitivity when compared to SVM, logistic regression, and random forests. It is also relatively computationally inexpensive.

[0255] Prediction

[0256] The script scripts/classification/predict . rb can be the driver for generating prediction scores. It takes one argument: url - (url, required) URL of feature data.

[0257] First, it can obtain feature data and save it to a file. Second, the feature data file can be sent through RData model to generate predictions. Finally, the predictions can be saved as Scores via the Data Ninja API.

[0258] The predictions generated can be the posterior probabilities of being a Match. The algorithm name is PRED, and the params field has some useful information about the model and features used.

[0259] Discovery of Matches

[0260] This is the default mode of the application, intended for use by the Collaborator. The user will be attempting to match his source Terms to the target Elements of a Dataset. Before starting, the user has the option to upload a Seed file, provided by the PI through some file sharing means (email, USB drive, etc.). Uploading a Seed will import a Dataset, its Elements and their AggregateTerms.

[0261] The Collaborator must first select the Dataset which he will contribute Matches to. The Dataset's collection of Elements (and AggregateTerms) is loaded into the app, and the user can now browse through the Elements. He can type into a search field to find an Element by name. When an Element is selected, its histogram, summary statistics, and other AggregateTerm information are displayed in the main content area of the app.

[0262] Next, the Collaborator selects a source, loading the collection of Terms originating from that source. The user has similar capabilities to browse and search for Terms by name. Selecting a Term will also display its histogram, summary statistics, and other information in the main content area. The histograms of the selected Term and AggregateTerm are overlayed on a single graph in different colors, allowing easy comparison of the distribution shapes. The summary statistics are displayed side by side for quick comparison as well. Color is used to differentiate Term information from AggregateTerm information.

[0263] The main content area allows quick, visual comparison between a Term and AggregateTerm, but the Collaborator must still find the most similar pair. The app provides the option to sort the list of Terms by their Scores with the selected AggregateTerm. This allows the user to instantly find the Term with the best similarity Score. If several score algorithms have been computed, each algorithm is presented as a sort option. The user can quickly flip through the sort options if he is unsatisfied with the ranking produced by a particular scoring algorithm.

[0264] When the Collaborator has found a suitable pair of similar Term and AggregateTerm, he clicks the "Match" button to indicate that he believes these to measure the same entity. Matches are stored via the Data API. At any time, the user may view the list of all Matches created for the selected Dataset. Clicking on an item in the list automatically selects the Term and AggregateTerm for viewing in the main content area. If the user has made a Match by mistake, he can delete it from the Match list.

[0265] Finally, when the Collaborator has discovered Matches for as many Elements as possible, he is ready to submit a Review to the PI. He can click the "Download Review" button to download a Review file which can be shared with the PI. This Review contains the Matches he created and the Terms that belong with them.

[0266] Review of Matches

[0267] The PI uses this mode of the application to review and approve Matches, and finally merge Terms into the Elements of his Dataset. First, the PI must upload a Review file to load the Matches and Terms he wishes to evaluate. Next, he selects the Dataset which he will merge these Terms to. Matches belonging to this Dataset are loaded and shown in the Match list. Elements of the Dataset and Terms in the Matches are also loaded as they were in discovery mode.

[0268] The Match list resembles an email inbox, indicating which Matches have not yet been viewed. Clicking a row displays the Term and AggregateTerm in the main content area. If the PI agrees with the Match, he may click the "Approve" button for that Match. The approval status is saved via the Data API so that review progress may be resumed at any time.

[0269] When all Matches have been approved, the "Merge" button becomes enabled. Clicking the Merge button displays a warning and confirmation dialog for the irreversible merge operation. If confirmed, the following operations occur through the Data API: 1) new Elements are created, containing the newly matched Terms; 2) a new Dataset is saved, containing the newly created Elements; and 3) all Matches are deleted. The new Dataset, new Elements and new AggregateTerms constitute a new version of the Seed. Versioning occurs for several reasons, among which are included: to allow publication and consumption of a known "good" state of a Dataset; to revert unwanted changes; and to reproduce historical work derived from a past version of a Dataset.

[0270] EXAMPLES

[0271] The invention will be further explained by the following Examples, which are intended to be purely exemplary of the invention, and should not be considered as limiting the invention in any way.

[0272] Example 1 : Multi-Institutional Retrospective Clinical Research

[0273] As the primary example used throughout this document, a researcher must use historical clinical observations to draw conclusions from the data. Because he is studying a rare disease, the data from one hospital is not comprehensive enough to prove or disprove hypotheses with acceptable confidence. He must collect data from other hospitals, but each hospital uses a different nomenclature when recording their clinical observations. This invention allows the researcher to declare the data elements he is interested in collecting, map his hospital's data to these elements, and accept other hospitals' data mappings to these same elements. This invention facilitates the collaborative mapping process so that the researcher can collect data from multiple institutions.

[0274] Example 2: Automated Data Collection

[0275] Highly skilled, well-trained medical experts are often needed to read data from one clinical database and enter it into another database. This data entry is done to obtain performance reports that can be used to (a) compare relative performance against other hospitals and (b) to identify areas of weakness or inefficiency for improvement. The reports require certain data elements in order to calculate their metrics, but each hospital's data uses a different terminology. This invention allows each participating hospital to map their terminology to the reporting terminology. With this mapping, the medical experts no longer need to enter data, but can focus their energies on ensuring the accuracy and quality of the data and the reports.

[0276] Example 3: Near Real-Time Decision Support Algorithms

[0277] Based on the real-time monitoring of vitals, electrolytes, blood gases, blood cell counts, and ventilator settings, bedside monitors may be able to (a) alert when critical intervention is necessary, (b) suggest effective interventions, (c) indicate the risk of mortality, (d) automatically adjust medication dosages or ventilator settings. These algorithms must be learned from an enormous volume of observation data, intervention data and outcomes data. This invention facilitates the inclusion of new datasets with different terminologies, allowing more data to contribute to the effectiveness of the decision support algorithms.

[0278] Example 4: Data Standards for Various Applications

[0279] With a standard set of data elements, an application ecosystem may be developed to use the data. Examples include a patient dashboard and rounding reports. The application developer need not worry about different terminologies because the data from every hospital is mapped to standard data elements. This invention makes the rapid creation of these mappings possible.

[0280] FIG. 15 depicts a computer device or system 1500 comprising one or more processors 1530 and a memory 1540 storing one or more programs 1550 for execution by the one or more processors 1530.

[0281] In some embodiments, the device or computer system 1500 can further comprise a non-transitory computer-readable storage medium 1560 storing the one or more programs 1550 for execution by the one or more processors 1530 of the device or computer system 1500.

[0282] In some embodiments, the device or computer system 1500 can further comprise one or more input devices 1510, which can be configured to send or receive information to or from any one from the group consisting of: an external device (not shown), the one or more processors 1530, the memory 1540, the non-transitory computer-readable storage medium 1560, and one or more output devices 1570. The one or more input devices 1510 can be configured to wirelessly send or receive information to or from the external device via a means for wireless communication, such as an antenna 1520, a transceiver (not shown) or the like.

[0283] In some embodiments, the device or computer system 1500 can further comprise one or more output devices 1570, which can be configured to send or receive information to or from any one from the group consisting of: an external device (not shown), the one or more input devices 1510, the one or more processors 1530, the memory 1540, and the non-transitory computer-readable storage medium 1560. The one or more output devices 1570 can be configured to wirelessly send or receive information to or from the external device via a means for wireless communication, such as an antenna 1580, a transceiver (not shown) or the like. [0284] Each of the above identified modules or programs corresponds to a set of instructions for performing a function described above. These modules and programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory may store a subset of the modules and data structures identified above. Furthermore, memory may store additional modules and data structures not described above.

[0285] The illustrated aspects of the disclosure may also be practiced in distributed computing environments where certain tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules can be located in both local and remote memory storage devices.

[0286] Moreover, it is to be appreciated that various components described herein can include electrical circuit(s) that can include components and circuitry elements of suitable value in order to implement the embodiments of the subject innovation(s). Furthermore, it can be appreciated that many of the various components can be implemented on one or more integrated circuit (IC) chips. For example, in one embodiment, a set of components can be implemented in a single IC chip. In other embodiments, one or more of respective components are fabricated or implemented on separate IC chips.

[0287] What has been described above includes examples of the embodiments of the present invention. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but it is to be appreciated that many further combinations and permutations of the subject innovation are possible. Accordingly, the claimed subject matter is intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims. Moreover, the above description of illustrated embodiments of the subject disclosure, including what is described in the Abstract, is not intended to be exhaustive or to limit the disclosed embodiments to the precise forms disclosed. While specific embodiments and examples are described herein for illustrative purposes, various modifications are possible that are considered within the scope of such embodiments and examples, as those skilled in the relevant art can recognize.

[0288] In particular and in regard to the various functions performed by the above described components, devices, circuits, systems and the like, the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., a functional equivalent), even though not structurally equivalent to the disclosed structure, which performs the function in the herein illustrated exemplary aspects of the claimed subject matter. In this regard, it will also be recognized that the innovation includes a system as well as a computer-readable storage medium having computer-executable instructions for performing the acts and/or events of the various methods of the claimed subject matter.

[0289] The aforementioned systems/circuits/modules have been described with respect to interaction between several components/blocks. It can be appreciated that such systems/circuits and components/blocks can include those components or specified subcomponents, some of the specified components or sub-components, and/or additional components, and according to various permutations and combinations of the foregoing. Subcomponents can also be implemented as components communicatively coupled to other components rather than included within parent components (hierarchical). Additionally, it should be noted that one or more components may be combined into a single component providing aggregate functionality or divided into several separate sub-components, and any one or more middle layers, such as a management layer, may be provided to communicatively couple to such sub-components in order to provide integrated functionality. Any components described herein may also interact with one or more other components not specifically described herein but known by those of skill in the art.

[0290] In addition, while a particular feature of the subject innovation may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "includes," "including," "has," "contains," variants thereof, and other similar words are used in either the detailed description or the claims, these terms are intended to be inclusive in a manner similar to the term "comprising" as an open transition word without precluding any additional or other elements.

[0291] As used in this application, the terms "component," "module," "system," or the like are generally intended to refer to a computer-related entity, either hardware (e.g., a circuit), a combination of hardware and software, software, or an entity related to an operational machine with one or more specific functionalities. For example, a component may be, but is not limited to being, a process running on a processor (e.g., digital signal processor), a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers. Further, a "device" can come in the form of specially designed hardware; generalized hardware made specialized by the execution of software thereon that enables the hardware to perform specific function; software stored on a computer-readable medium; or a combination thereof.

[0292] Moreover, the words "example" or "exemplary" are used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the words "example" or "exemplary" is intended to present concepts in a concrete fashion. As used in this application, the term "or" is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise, or clear from context, "X employs A or B" is intended to mean any of the natural inclusive permutations. That is, if X employs A; X employs B; or X employs both A and B, then "X employs A or B" is satisfied under any of the foregoing instances. In addition, the articles "a" and "an" as used in this application and the appended claims should generally be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form.

[0293] Computing devices typically include a variety of media, which can include computer-readable storage media and/or communications media, in which these two terms are used herein differently from one another as follows. Computer-readable storage media can be any available storage media that can be accessed by the computer, is typically of a non-transitory nature, and can include both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer-readable storage media can be implemented in connection with any method or technology for storage of information such as computer-readable instructions, program modules, structured data, or unstructured data. Computer-readable storage media can include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other tangible and/or non-transitory media which can be used to store desired information. Computer-readable storage media can be accessed by one or more local or remote computing devices, e.g., via access requests, queries or other data retrieval protocols, for a variety of operations with respect to the information stored by the medium. [0294] On the other hand, communications media typically embody computer- readable instructions, data structures, program modules or other structured or unstructured data in a data signal that can be transitory such as a modulated data signal, e.g., a carrier wave or other transport mechanism, and includes any information delivery or transport media. The term "modulated data signal" or signals refers to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in one or more signals. By way of example, and not limitation, communication media include wired media, such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.

[0295] In view of the exemplary systems described above, methodologies that may be implemented in accordance with the described subject matter will be better appreciated with reference to the flowcharts of the various figures. For simplicity of explanation, the methodologies are depicted and described as a series of acts. However, acts in accordance with this disclosure can occur in various orders and/or concurrently, and with other acts not presented and described herein. Furthermore, not all illustrated acts may be required to implement the methodologies in accordance with the disclosed subject matter. In addition, those skilled in the art will understand and appreciate that the methodologies could alternatively be represented as a series of interrelated states via a state diagram or events. Additionally, it should be appreciated that the methodologies disclosed in this specification are capable of being stored on an article of manufacture to facilitate transporting and transferring such methodologies to computing devices. The term article of manufacture, as used herein, is intended to encompass a computer program accessible from any computer-readable device or storage media.

[0296] Although some of various drawings illustrate a number of logical stages in a particular order, stages which are not order dependent can be reordered and other stages can be combined or broken out. Alternative orderings and groupings, whether described above or not, can be appropriate or obvious to those of ordinary skill in the art of computer science. Moreover, it should be recognized that the stages could be implemented in hardware, firmware, software or any combination thereof.

[0297] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to be limiting to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the aspects and its practical applications, to thereby enable others skilled in the art to best utilize the aspects and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

We claim:

1. A computer-implemented system for identifying matching data of a real-world, measurable concept, where the data have inconsistent associated descriptive labels, said system comprising:

computer software executed on appropriate computer hardware, wherein the software executes the following method steps:

establish elements of a dataset;

compare observations of a real-world occurrence of a measurable concept to appropriate elements of the dataset;

output data representing the compared observation and elements of the dataset;

wherein the output indicates whether the compared observation and elements of the dataset represent the same real-world, measurable concept.

2. The system of claim 1, wherein the step of establishing elements of a dataset is performed based on a single set of data.

3. The system of claim 1, wherein the step of establishing elements of a dataset is performed based on multiple inputs of data from multiple sources.

4. The system of any one of claims 1-3, wherein the output data is a graphical representation indicating whether the observations were consistent with the selected elements of the dataset.

5. A computer-implemented system for matching data of a real-world measurable concept, where the data have inconsistent associated identifying or descriptive labels, said system comprising:

a) a Term Generation Framework, in which

observational data, stored in any of a wide variety of formats, is used to compute the aggregate facts (Terms) about observations of a single entity, wherein aggregate facts include but are not limited to histograms, summary statistics, descriptive labels, and units of measure; a single Term derives from a single data source, but different Terms may come from multiple data sources, differentiated by geography, physical location, software measures, hardware, data formats, policies, and security;

b) a module for creation of dataset and elements, which establishes a single purpose for a Dataset and establishes the required real-world data Elements that serve that purpose;

c) an Output module for mapping/correlation, indicating equivalence between Terms and Elements of a Dataset;

d) a Score Generation Framework, in which

the construction of the output is aided by the computation of the relative similarity of aggregate facts about observations of a single entity;

similarity is expressed according to, but not limited to, the following features: distribution, descriptive label, and units of measure;

distribution similarity is computed using statistical or information theoretic measures, including Kullback-Leibler Divergence estimates and Earth Mover's Distance;

label similarity is computed using string distance metrics, including variations on

Jaro-Winkler, and Level2Jaro-Winkler;

label similarity incorporates semantic meaning from established semantic databases, such as the UMLS Metathesaurus and its constituent ontologies and terminologies;

units similarity is determined from standard units in practice, equivalent units, common abbreviations, and established unit standards such as UCUM; and

composite similarity across multiple features is computed as well, including but not limited to a linear combination (weighted sum) of individual features or the na^'ive bayes classifier;

e) a Data API, which is an application programming interface that provides storage, retrieval, and manipulation of data necessary for the collaborative term mapping process, including but not limited to Terms, Scores, Elements, Datasets, Matches, and AggregateTerms; and

f) a Web Application, which is a computer software user interface that

i) facilitates the assignment of Terms to Elements of a Dataset and utilizes the Data API to assign a Term to an Element, and provides a way to browse, search, sort, and filter candidate Terms based on criteria including not limited to descriptive label, units of measure, similarity scores, and summary statistics,

evaluates the relevancy of a Term to an Element, wherein supporting data is displayed so it can be visualized by a user, wherein the supporting data includes but is not limited to visual histogram charts, similarity scores, descriptive labels, summary statistics, units of measure, tabular numeric data, and time series data,

suggests Terms to be assigned to Elements by creating a Match, wherein matches are created by one party may be shared with another party for review in the same fashion as evaluation of relevancy with supporting data, wherein matches may be approved and subsequently merged, and wherein the result of merging Matches is a new version of an Element containing data from the merged Term and a new version of a Dataset containing the new Elements.

6. A computer implemented method for analyzing databases, comprising:

on a device having one or more processors and a memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for:

1. obtaining a first aggregate attribute from first data stored in a first database, the first database comprising a first element;

2. obtaining a second aggregate attribute from second data stored in a second database, the second database comprising a second element;

3. comparing the first aggregate attribute of the first element with the second aggregate attribute of the second element;

4. obtaining a set of features as a result of the comparing step, the set of features comprising a quantitative measure of a similarity of the first aggregate attribute of the first element and the second aggregate attribute of the second element; and

5. determining whether the first element and second element are equivalent to each other; and

6. outputting a map that relates the first element from the first database with the second element from the second database.

7. The computer implemented method of claim 6, wherein the first aggregate attribute comprises one or more from the group consisting of a first statistical measurement derived from the first aggregate attribute, a first label derived from the first aggregate attribute and a first unit of measurement associated and the first aggregate attribute, and

wherein the second aggregate attribute comprises one or more from the group consisting of a second statistical measurement derived from the second aggregate attribute, a second label derived from the second aggregate attribute and a second unit of measurement associated and the second aggregate attribute.

8. The computer implemented method of claim 7, wherein the first statistical measurement is a first histogram comprising an x-axis of observed values for the first aggregate attribute, wherein the second statistical measurement is a second histogram comprising an x-axis of observed values for the second aggregate attribute, and

wherein the step of comparing comprises a statistical comparison of the first and second histograms.

9. The computer implemented method of claim 8, wherein the statistical comparison of the first and second histograms comprises one from the group consisting of a probability- distance measure, Kullback-Leibler divergence, Earth Mover's Distance, Kolmogorov-Smirnoff similarity and Anderson-Darling similarity.

10. The computer implemented method of any one of claims 7-9, wherein the first label and the second label are compared in the step of comparing by using one from the group consisting of a string-edit technique, a token-based distance technique, a semantic comparison technique, a lexical similarity technique, a language system, SNOMED-CT, LOINC, UMLS, a thesaurus, synonyms of the first and second labels, a string-matching technique, a Level2Jaro- Winkler hybrid distance technique and prevalence similarity.

11. The computer implemented method of any one of claims 7-10, wherein the first unit of measurement and the second unit of measurement are compared in the step of comparing by using one from the group consisting of a widely accepted standard unit, a list of different spellings relating to the standard unit, a list of commonly used alternative units, a list of other units measuring the same quantity and a substitution of the standard unit for a missing unit.

12. The computer implemented method of any one of claims 6-11, the one or more programs further including instructions for:

displaying a graphical user interface comprising one or more fields for displaying information corresponding with one or more of the first database, the first element, the second database, the second element, the first aggregate attribute, the second aggregate attribute and the comparing of the first aggregate attribute and the second aggregate attribute.

13. The computer implemented method of claim 12, wherein the graphical user interface comprises one or more fields for displaying information corresponding with one or more of the first statistical measurement, the first label, the first unit of measurement, the second statistical measurement, the second label and the second unit of measurement.

14. The computer implemented method of claim 13, wherein the graphical user interface comprises one or more fields for displaying information corresponding with one or more of the first histogram, the second histogram and the statistical comparison.

15. The computer implemented method of any one of claims 6-14, wherein the step of outputting the map comprises displaying a list of first elements or a list of second elements sorted by a statistical score while concurrently displaying one or more histograms based on the first and second aggregate attributes.

16. The computer implemented method of any one of claims 12-15, wherein the graphical user interface is adapted to allow a user to select from a plurality of first databases, a plurality of first elements, a plurality of second databases, a plurality of second elements and a plurality of systems for comparing the first aggregate attribute and the second aggregate attribute.

17. The computer implemented method of any one of claims 12-16, wherein the graphical user interface further comprises a scatterplot based on the first aggregate attribute and the second aggregate attribute.

18. The computer implemented method of any one of claims 6-17, the one or more programs further including instructions for:

calculating a match prediction score.

19. The computer implemented method of any one of claims 6-18, the one or more programs further including instructions for:

repeating steps 1-5 for every combination of elements from each database.

20. A computer system for analyzing databases, comprising:

one or more processors; and

memory to store:

one or more programs, the one or more programs comprising instructions for:

21. A non-transitory computer-readable storage medium storing one or more programs for analyzing databases, the one or more programs for execution by one or more processors of a computer system, the one or more programs comprising instructions for: