WO2025210554A1 - Data set discovery in data exchanges - Google Patents
Data set discovery in data exchangesInfo
- Publication number
- WO2025210554A1 WO2025210554A1 PCT/IB2025/053497 IB2025053497W WO2025210554A1 WO 2025210554 A1 WO2025210554 A1 WO 2025210554A1 IB 2025053497 W IB2025053497 W IB 2025053497W WO 2025210554 A1 WO2025210554 A1 WO 2025210554A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- data sets
- user
- data set
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
Definitions
- the present disclosure relates generally to methods, processes and systems for sharing data sets and, more specifically, to data set discovery in data exchanges.
- a data marketplace is a platform where data providers can list and share their data sets, and data consumers can browse and download (in some cases upon purchase of a license or subject to an open-source license) these data sets for various purposes, such as analysis, machine learning, or business intelligence.
- the diversity and number of data sets in such platforms can be expansive, making it difficult for users to find, evaluate, and trust data sets that are available.
- Such systems are expected to become more important as more data becomes better with improved sensors, collection, and reporting; and as data use cases expand through machine learning, better statistical analysis tools, and more disciplined data- driven decision-making practices are adopted.
- Some aspects include a computer-implemented method or process, including: obtaining, with a computer system hosting a data exchange, a plurality of data sets from a plurality of different users of the data exchange; receiving, with the computer system, a query to search among the plurality of data sets, the query specifying a token; determining, with the computer system, a group of data sets among the plurality of data sets that is responsive to the query, wherein determining the group of data sets comprises checking titles, tables, field names, and cell values of data sets among the plurality of data sets for use of the token; determining, with the computer system, a ranking of members of the group of data sets responsive to the query; and sending, with the computer system, in response to the query, summaries of at least some members of the group of data sets with instructions to present the summaries according to the ranking.
- Some aspects include a system, including: one or more processors; and memory storing instructions that when executed by the processors cause the processors to effectuate operations of the above-mentioned process.
- Figure 1 illustrates a logical and physical architecture of a computing environment in which a data exchange may be implemented and used in accordance with some embodiments of the present techniques.
- Figure 2 is an example of a process that may be executed by the system of figure 1 to facilitate data discovery in accordance with some embodiments of the present techniques.
- Figure 3 illustrates an example of a process that may be executed by the system of figure 1 to provide a social graph overlay on information in a data exchange in accordance with some embodiments of the present techniques.
- Figure 4 is an example user interface showing query completion in accordance with some embodiments of the present techniques.
- Figure 5 is a user interface showing an example of a default home page in accordance with some embodiments of the present techniques.
- Figure 7 is an example of a dataset detail history page in accordance with some embodiments of the present techniques.
- Figure 8 is an example of a table detail landing page in accordance with some embodiments of the present techniques.
- Figure 9 is an example of a search results page in accordance with some embodiments of the present techniques.
- Some embodiments include a software as a service (SaaS) data marketplace, or other form of data exchange, with improved discovery of data sets for users.
- SaaS software as a service
- Some embodiments support searching for data sets by table across data sets with ranking of search results that accounts for absent data often plaguing lower-quality data sets.
- Some embodiments also account for downstream applications pulling data from these tables when forming the rankings, up-ranking those data sets and tables with greater usage.
- search results are presented in a visually-compelling, area-based visualization like a tree map that supports zooming and panning to quickly and smoothly explore search results, e.g., in a web-browser or native application.
- Embodiments in some cases also surface frequency of query terms in the data sets to facilitate exploration of search results and implement recommendation systems to personalize recommendation of data sets for users. These aspects are described below with reference to Figure 2, after aspects are described of an example system in which they may be implemented with reference to Figure 1.
- the computing environment may include a server system 300, a third-party server 309, a client computing device 307, and the internet 305.
- the server system 300 may include a controller 302, a web server 304, an API (application program interface) server 306, and a database 308, along with the following: user authentication and authorization module 310; user profile management module 312; data upload and storage module 314; data catalog and discovery module 316; data quality and metadata management module 318; pricing and billing module 320; analysis tools integration module 322; data security and encryption module 324; audit and compliance module 326; feedback and ratings module 328; notification system module 330; APIs and integration endpoints module 332; administration and moderation tools module 334; and infrastructure management module 336.
- user authentication and authorization module 310 user profile management module 312
- data upload and storage module 314 data catalog and discovery module 316
- data quality and metadata management module 318 data quality and metadata management module 318
- pricing and billing module 320 pricing and billing module 320
- analysis tools integration module 322 data security and encryption module 324
- audit and compliance module 326 audit and compliance module 326
- feedback and ratings module 328 notification system module 330
- the controller 302 may coordinate the operation of the other listed components, e.g., by assigning tasks, sending data, and routing results, instructions, and data for user interfaces through the web-server 304 or API server 306, which in some cases, maybe non-blocking servers configured to support concurrent sessions with a plurality of remote computing devices.
- the system 300 may use a software framework or platform, which may be FlaskTM, DjangoTM, ExpressTM, Ruby on RailsTM, or any suitable software development framework may be employed to achieve analogous functionalities.
- the system 300 in certain implementations, may utilize a database interaction module or solution. Examples include SQLAlchemy TM, MySQL TM, PostgreSQL TM, MongoDB TM, Oracle TM, or other database management systems might be employed to store, retrieve, and manage user profile data in database 308.
- Some embodiments may include a user authentication and authorization module 310.
- this module 310 ensures (e.g., verifies or increases the likelihood) that users are who they claim to be. This may involve username and password combinations, OAuth integration with third-party services, or multi -factor authentication, e.g, with passkeys, webauthn, FIDO 2, or other suitable protocols.
- module 310 may also determine what actions a user is permitted to perform, like uploading data, purchasing access, etc.
- a user model in some embodiments, may be defined with various attributes. One of the attributes may include an is admin attribute or an equivalent flag to distinguish regular users from administrative users.
- the system 300 supports multiple tenants (e.g, businesses or organizations), and tenant-specific accounts may support roles and permissions specific to their account, e.g., limiting certain forms of access (for instance, read or write) to only their employees or only those employees having a certain role.
- tenants e.g., businesses or organizations
- tenant-specific accounts may support roles and permissions specific to their account, e.g., limiting certain forms of access (for instance, read or write) to only their employees or only those employees having a certain role.
- the module 310 may accept credentials, which may include a username and password entered and submitted via a log-in interface on device 307 or submitted with an API request from system 309.
- credentials may be checked against a database, like database 308.
- the password rather than being stored in plain text, may be hashed.
- Other mechanisms such as token-based authentication, biometric verification, or two-factor authentication, may be implemented as alternatives or in addition to password checking.
- the module 310 may employ authorization mechanisms to distinguish the functionalities available to regular users versus administrative users based on the aforementioned is_admin attribute or other determining factors. While a dashboard is one possible interface, other embodiments may use landing pages, portals, or other user interfaces tailored to the authenticated user's permissions. For user logout, the module 310 in some embodiments may invalidate the user's session, although other mechanisms, such as token expiration or user-driven session termination, might also be employed.
- Some embodiments may include user profde management module 312.
- This module 312 may allow users to create, update, and manage their profde information. Some embodiments may also allow users to set preference, payment details, and other aspects of their profde, along with tenant employer and role therein.
- the user profde data may encompass various attributes. A representative attribute set might be a username, an email, and user preferences, among others. However, other attributes may be included, such as profde images, user histories, associated devices, or any other relevant user-specific data.
- the user preferences in one embodiment might be stored as a string formatted in JSON (JavaScrip TM object notation). In other embodiments, the preferences could be stored in structured tables, XML (extensible markup language) strings, or other data formats suitable for capturing user preferences and settings.
- the user might be able to select and submit their data through such an interface exposed by module 314.
- the module 314 may validate the provided data, ensuring appropriate format and content.
- the data may be stored in a predefined or dynamically determined location.
- the data could be stored locally on server system 300.
- the system 300 might use cloud storage solutions, distributed databases, content delivery networks (CDNs), local cache on device 307, or other scalable storage options.
- CDNs content delivery networks
- the module 314 might also maintain a record, such as a catalog, of the uploaded data sets.
- Each data set may be associated with metadata, potentially including a filename, date of upload, source, description, characteristic statistics (like measures of central tendence and measures of variation) computed by module 314 upon ingest, or any other relevant data attributes.
- the storage of such metadata can provide a means for future retrieval, organization, or analysis of the uploaded data sets.
- Module 314 may perform the below-described indexing operations to facilitate search operations described with reference to Figure 2 upon upload or as a batch process performed periodically.
- Module 314 may also include the following: validation mechanisms to handle various data formats and ensure data integrity; integration with user authentication systems to assign data ownership and access permissions; mechanisms to handle larger data sets, such as chunking, compression, or streaming uploads; notifications or alerts to inform users about the status of their data upload; backup, redundancy, and recovery functionalities to ensure data durability and availability.
- Some embodiments may include data catalog and discovery module 316.
- module 316 may provide search functionality for users to discover data sets of interest.
- Some embodiments may also include a catalog system for cataloging and tagging data sets. Some embodiments may generate data previews used to relatively quickly provide samples of data before the full data set is accessed.
- the module 316 may peform the search operations described below with reference to Figure 2.
- a collection of data may be implemented to represent and store details about available data sets.
- This object may contain attributes including, but not limited to (which is not to suggest other lists are limiting), a unique identifier, a name, a description, an associated filename, and associated tags.
- the object may include additional attributes such as upload dates, data sizes, user ownership, or other relevant metadata.
- the module 316 may provide an interface, such as instructions like HTML, JavaScript TM, and the like by which a web interface is created and changed on device 307.
- This interface might allow users to view a list of all available data sets.
- search functionality may be incorporated, allowing users to query based on data set tags or other attributes.
- this search capability might use string matching, but in other embodiments, other search mechanisms, like full-text search, machine learning-based recommendations, embedding vector search, or filtered searches might be employed, examples of which are described below in greater detail with reference to Figure 2.
- users Upon selecting a data set from the list or search results, users might access detailed views of the data sets, displaying extensive metadata and possibly offering download or access links, again with greater detail below provided with reference to Figure 2.
- module 316 may include the following: mechanisms to handle pagination or lazy loading for displaying extensive lists of data sets; enhanced user interfaces, possibly interactive, using technologies like AJAX, WebSockets, or single page application (SPA) frameworks; integration with user authentication systems to personalize data set listings or manage access permissions; analytics to track and present data set popularity, access frequencies, or user interactions; and integration with third-party platforms or services to extend data set sourcing, storage, or accessibility.
- a function or other method may generate metadata for the data sets. The generation can be based on the analysis of the raw data. For example, for data sets in CSV (comma separated value) format, the metadata might include row count, column count, and column names.
- the metadata may comprise statistics such as mean, median, mode, standard deviation, or any other descriptive statistic related to the data.
- Other examples include column (or field) specific visualizations, like histograms, spark lines, box plots, or the like.
- data sets, tables, or sets of values of fields may be submitted to the ChatGPT TM code interpreter plugin with a prompt requesting some form of data analysis, like a request to generate a heat map of the world or North America when geographic data is detected.
- the resulting visualizations or other forms of analysis may be included in the meta data and displayed in a similar manner as the user navigates from the search results to the data set itself, e.g., upon the user selecting the data set in a tree map, upon the users selecting a table therein, or upon the user selecting a field therein (or pairs of fields or larger collections to view relationships).
- metadata may be precomputed before receiving the query or request to navigate to a view of a user interface that displays the metadata.
- Some embodiments may populate a “my recommended data” interface or similar default interface to facilitate discovery of data sets with a recommendation engine of module 316. Examples may surface data with which you have interacted previously and has subsequently been revised or commented upon, up-weighting based on freshness of such edits in ranking.
- One such method may employ collaborative filtering techniques. These techniques might operate by examining the behavior or preferences of multiple users within the data marketplace or other form of data exchange. In certain implementations, the method may focus on identifying users that are similar to a target user based on their past interactions with data sets, such as data sets they have purchased, rated, or viewed. Once a set of similar users is identified, the system can aggregate data sets that those users have shown a preference for and subsequently recommend a subset of those aggregated data sets to the target user.
- Content-based filtering techniques may focus on the features or metadata of data sets and a profile of the user's preferences.
- data sets that the user has shown interest in may have their features or metadata weighted more heavily in the user's profile, while those that the user has shown disinterest in may receive a negative weighting.
- the system may then compare the features or metadata of unacquired data sets to this weighted profile to generate a score for each data set. Data sets with the highest scores, indicating a greater alignment with the user's profile, may then be recommended to the user.
- hybrid methods may be used which combine aspects of both collaborative filtering and content-based filtering to suggest data sets within a data marketplace.
- One potential approach may involve separately running collaborative filtering and content-based models and subsequently combining their scores to generate a final recommendation score for each data set.
- the combination of scores may be achieved through various methods including but not limited to a weighted sum, averaging, or a more complex function that takes into account other factors.
- matrix factorization techniques may be utilized in certain embodiments, especially where the user-data set interaction matrix within the marketplace is decomposed into multiple matrices representing latent factors. Predictions derived from the decomposition can help in suggesting data sets to users. Matrix factorization may involve approximating a useritem interaction matrix by decomposing it into multiple matrices, which may represent latent or hidden features. These latent features could provide a condensed representation and might capture patterns that are not immediately apparent in the initial matrix.
- the user-item interaction matrix in some instances, might be of size m x n, where m represents the total number of users and n represents the total number of items, such as data sets.
- the individual entries in this matrix may represent the interaction or preference of a specific user towards a specific item. However, it is to be noted that many of these entries could be missing or undefined, given that not every user may interact with every item.
- the matrix may be factorized into two distinct matrices.
- One matrix often referred to as P, might be of size m x k and could represent the association or affinity between users and certain features.
- a second matrix which might be labeled as Q, could be of size n x k and may denote the association between items and the same set of features.
- the product of these matrices (or a suitable transformation of one, such as its transpose) might serve to approximate the original user-item interaction matrix.
- the process of factorizing the original matrix into these two matrices might involve minimizing the difference or error between the product of these matrices and the known entries in the original matrix.
- This error in some implementations, could be quantified using specific loss functions.
- One common approach could be to use a function that captures the Mean Squared Error (MSE) between known interactions in the original matrix and the corresponding values in the product of the factorized matrices.
- MSE Mean Squared Error
- the latent features captured in the matrices may not inherently possess explicit meanings. However, they may encapsulate distinct patterns in useritem interactions. For instance, within a marketplace for books, these latent features might implicitly represent genres, themes, or author styles, even if such labels are not explicitly associated.
- predictions for unknown or missing entries in the original matrix can be made by computing their product or an appropriate transformation thereof. These predictions, in turn, could facilitate the generation of personalized recommendations for users of data sets.
- Developers might introduce additional features to augment the matrix factorization process. For instance, biases associated with specific users or items could be incorporated into the factorization model to account for inherent tendencies. Additionally, temporal dynamics, capturing how user preferences evolve overtime, might be integrated into the model. Furthermore, external information, such as metadata about items or user demographics, could also be fused into the factorization process to provide more context-aware recommendations of data sets.
- Deep learning techniques may be adopted within a data marketplace in some embodiments.
- Neural networks possibly including architectures such as autoencoders, may be trained on user-data set interaction data.
- the latent representations captured during the training may help compute similarity scores between data sets, users, or both, to further enhance data set recommendations.
- Contextual recommendations within a data marketplace may consider variables such as the time of day, user's recent search history, the specific sector or field of the user, or any other contextual data that may be relevant. This might allow for more tailored data set suggestions, such as recommending finance-related data sets during market hours or data sets related to a recent news event if the user has shown interest in that area.
- Developers may also choose to integrate additional features into the recommendation process of the data marketplace. For example, algorithms can incorporate rules to ensure diverse data set recommendations or to avoid recommending data sets the user has recently viewed or acquired. In some scenarios, the recommendation engine might also consider factors like global research trends, emerging fields of study, or external news and events that could influence data set preferences.
- Some embodiments may include data quality and metadata management module 318.
- This module 318 and some embodiments may generate or allow input of metadata such as descriptive information for data sets.
- Some embodiments of module 318 may provide tools or processes to assess and rate the quality or reliability of data sets as well.
- the system may allow external input of custom metadata.
- the input metadata might contain details like description, source of the data set, creation date, or any other relevant information.
- the externally provided metadata may be integrated with the generated metadata.
- module 318 may itself or through input from users assess the quality of the data sets. The assessment may be based on various criteria. In one example, the quality is determined by the number of rows in the data set. A data set with more than a threshold number, say 1000 rows, may be deemed of high quality. In other embodiments, quality assessment might be based on the completeness of the data, absence of null or missing values, consistency in data patterns, or any other relevant factor. Some embodiments may employ machine learning models or statistical techniques to determine data set quality. In certain embodiments, module 318 may allow for the retrieval of the generated or inputted metadata. This provides users or other systems with descriptive information about the data sets.
- a quality score may be generated and stored in association with the data set. This score might be on a scale, such as 1 to 5. Embodiments may retrieve and present this quality score, giving insights into the perceived value or reliability of the data set. Some embodiments may use this score for ranking data sets to return in response to a query, up-ranking data sets designated as being higher- quality.
- Metadata might also include information like fde size, data type distributions, unique values count, or any other relevant metric, and the quality assessment can be based on numerous other factors, including data freshness, source reliability, historical accuracy, or user feedback.
- the quality score might be represented in different formats, such as stars, grades, or descriptive labels like “High Quality” or “Low Quality.”
- Some embodiments include pricing and billing module 320.
- This module 320 in some embodiments allows data providers to set prices to access their data sets. Some embodiments handle transactions when a user wants to purchase access and manage billing, in some cases with subscription or tiered payment model supported.
- data sets may be represented as instances of a data set class. Each data set may possess an identification number (id), a descriptive name (name), and an associated price (price). Additionally, users of the marketplace may be represented as instances of a user class, which may have an identification number (id), a name (name), and a balance (balance) indicating their available funds.
- module 320 a transaction history may be maintained for each user, allowing them to view their past purchases and top-up actions, and a promotional or discount mechanism may be integrated, allowing data set providers or the marketplace administrator to offer data sets at discounted prices for limited periods.
- Some embodiments of module 320 may implement tiered pricing structures, allowing users to choose from different access levels or data quality tiers at varying prices.
- Refund mechanisms may be provided by module 320, allowing users to request a refund within a specific window after purchase if they find the data set unsatisfactory.
- module 320 may support subscription models, where users pay a regular fee for access to a collection of data sets or for enhanced features within the marketplace. Integration with external payment gateways or financial institutions may be incorporated by module 320 to facilitate transactions, manage user balances, and handle currency conversions if the marketplace operates across multiple countries.
- Module 322 may allow users to perform basic or complex analytics on data sets, in some cases integrating with third-party tools. Some embodiments may include visualization tools for graphical representation of data or transformations thereon. Module 322, in certain embodiments, is equipped with an endpoint to accept data uploads. For instance, this endpoint may be designed to accept files in a CSV format, though in other embodiments, other formats like JSON, ExcelTM, Parquet, or database dumps may be accommodated. Once data is uploaded, it may be stored temporarily in memory for quick processing, although other storage mechanisms like databases, file storage, or cloud storage solutions can also be employed.
- module 322 after data has been uploaded, a user may request statistics about the data set.
- the statistics generated might include measures of central tendency like mean, mode, or median; measures of spread like standard deviation or variance; and counts of data points, among others.
- other statistics or data summaries could be computed, such as mode, skewness, kurtosis, or custom-defined metrics.
- the module 322 provides visualization capabilities.
- a user might request a histogram of a particular data column, but bar charts, line charts, scatter plots, or more complex visualizations may also be provided.
- matplotlib is employed to generate these visualizations, or other visualization libraries or tools such as Seaborn TM, PlotlyTM, D3.jsTM, or even integrated services like TableauTM or PowerBITM could also be used.
- the visualizations, once generated, may be converted into a suitable format for web transmission.
- a PNG (portable network graphics) image format is employed.
- JPEG Joint Photographic Experts Group
- GIF Graphics Interchange Format
- SVG Scalable Vector Graphics
- HTML Hypertext markup language
- JavaScriptTM-based visualizations can be used.
- error handling and data validation are incorporated to enhance system robustness. This might involve checking the uploaded data for consistency, missing values, or potential anomalies.
- the module 322 could provide feedback or even suggestions for data cleaning and preprocessing.
- Some embodiments of data security and encryption module 324 may use encryption techniques to safeguard sensitive data, ensuring confidentiality, integrity, and availability of data.
- the module 324 utilizes symmetric encryption for securing data at rest.
- a key which may be derived from a password or other secret, may be used to both encrypt and decrypt the data.
- the generation of this symmetric key may involve the use of a salt, which is may include a random sequence of bytes.
- This salt when combined with the password, can be input to a Key Derivation Function (KDF) to produce the symmetric key.
- KDF Key Derivation Function
- One possible KDF that may be used is PBKDF2HMAC, although other KDFs such as Argon2, scrypt, or bcrypt might also be used depending on system requirements.
- the salt is generated using a cryptographic random number generator.
- the generated salt may be stored separately from the encrypted data, and in some implementations, it might be stored alongside the encrypted data, typically as a prefix to the ciphertext.
- the password used for deriving the symmetric key is not hard-coded within the system but is obtained from secure external sources or inputs, ensuring dynamicity and enhanced security. This password might be provided by an end user, retrieved from secure environment variables, or fetched from secure key management systems.
- asymmetric encryption methods might be employed in some embodiments of module 324.
- Asymmetric encryption such as public-key cryptography, may involve the use of a pair of keys: a private key, which remains confidential, and a public key, which may be shared openly. Data encrypted with the public key can only be decrypted with the corresponding private key, and vice versa.
- the RSA encryption algorithm is utilized for asymmetric encryption, although other algorithms like Elliptic Curve Cryptography (ECC) or Diffie-Hellman might also be considered.
- ECC Elliptic Curve Cryptography
- the RSA key pair can be generated with varying key sizes, depending on the desired security level. While a 2048-bit key size might be used in many scenarios, larger key sizes like 3072-bit or 4096-bit may be selected for heightened security environments.
- Data encrypted for transit might employ specific padding schemes.
- the OAEP Optimal Asymmetric Encryption Padding
- MGF1 mask generation function
- SHA256 hash function like SHA256.
- other padding schemes and hash functions may be employed based on specific requirements.
- the module 324 might provide functionalities to serialize the generated RSA keys into standard formats, such as PEM, for storage or transmission purposes.
- the private key due to its sensitive nature, may be stored securely using hardware security modules, encrypted filesystems, or other secure storage mechanisms. Additionally, in some embodiments, the module 324 may include features to enhance the encryption process.
- MACs Message Authentication Codes
- the system might offer ways to rotate keys regularly, ensuring that even if a key is compromised, its window of vulnerability is limited.
- robust logging and alerting mechanisms may be integrated to notify system administrators or users of any encryption-related anomalies.
- Some embodiments include audit and compliance module 326, which in some cases may keep track of who accesses what data and when, for instance with access logs in the database 308. Some embodiments may assist in ensuring that the marketplace adheres to relevant regulations and standards like General Data Protection Regulation (GDPR) or Health Insurance Portability and Accountability Act (HIPAA).
- GDPR General Data Protection Regulation
- HIPAA Health Insurance Portability and Accountability Act
- the system may include audit records and a scheduled process responsible for interfacing with the underlying database 308 and performing compliance checks.
- the audit record may comprise a user identifier, an action, a data set identifier, and a timestamp.
- the user identifier can be any form of unique identification associated with a user in the data marketplace.
- the action may be a representation of any action performed by the user, such as accessing a data set, modifying a data set, or any other relevant activities.
- the data set identifier can be an identification number or reference associated with a specific data set in the marketplace.
- the timestamp in some embodiments, may capture the exact time when the action was performed.
- This retrieval can be general or filtered based on certain attributes like a specific user identifier.
- the retrieval functionality can be extended to include more complex query operations, leveraging the capabilities of the underlying database system.
- a compliance check function may also be provided. In some embodiments, this function checks the actions of a user against certain compliance criteria. For instance, it may ensure that a user has not exceeded the number of accesses for a particular data set.
- the criteria for compliance checks can be highly configurable, and in other embodiments, they may involve checking against regional data access regulations, adherence to user or data set-specific access controls, or any other relevant checks.
- Some embodiments may include feedback and ratings module 328.
- This module may allow users to rate and review data sets associating those forms of feedback with the data sets in database 308. Some embodiments may help users assess the data sets value and quality. Some embodiments may allow data providers to receive feedback to improve their offerings as well.
- the module 328 may be configured to allow users to submit feedback about data sets, comprising both a numerical rating and a textual comment. Furthermore, the system may be designed to retrieve and display such feedback to other users to aid in their evaluation of data sets.
- the module 328 includes a portion of database 308 configured to store feedback entries. Each feedback entry may include an identifier for the data set, an identifier for the user submitting the feedback, a numerical rating, and a textual comment.
- the system may provide an HTTP endpoint configured to receive POST requests for submitting feedback. When a POST request is received at this endpoint, the system may extract details of the feedback from the request, create a new feedback entry in the database, and respond with a confirmation. The details extracted from the request may include, but are not limited to, the data set identifier, user identifier, rating, and comment.
- the system may provide an HTTP endpoint configured to handle GET requests to retrieve feedback for a specific data set. Upon receiving a GET request at this endpoint, the system might query the database for all feedback entries associated with the specified data set and return the results.
- user authentication mechanisms may be integrated, e.g., with module 310 above.
- users Before submitting or retrieving feedback, users might be required to authenticate themselves.
- the authentication can be based on various methods, such as usemame/password combinations, OAuth integration, multi -factor authentication, or other authentication methods.
- the system may include a mechanism for verifying the authenticity of feedback. For instance, after submitting feedback, a user might receive an email prompting them to confirm their submission. Only after receiving such confirmation might the feedback be displayed to other users.
- the system might offer features for users to edit or delete their previously submitted feedback.
- feedback entries may be tagged with categories, topics, sentiment, or other labels, indicating specific attributes or characteristics of the feedback, such as “positive,” “negative,” “technical issue,” or “data inconsistency.”
- Some embodiments include notification system module 330.
- This module 330 may establish a table or other collection of notification records, which may contain multiple fields or columns such as id, user_id, message, and status.
- the id may serve as a unique identifier for each notification, ensuring individual addressing.
- the user id may be used to identify the intended recipient of the notification.
- the message field may store the actual content or body of the notification, while the status field can indicate whether the notification has been read or remains unread by the user.
- the system when anew notification is to be generated, receives a POST HTTP request.
- This request may include data such as the user's identifier and the content of the message to be delivered.
- the system can verify the data, ensuring the presence of necessary information.
- the system may insert the new notification into the database, assigning it a unique identifier and marking its initial status as 'unread'.
- Users or recipients can retrieve their notifications by sending a GET HTTP request via device 307, specifying their unique user identifier.
- the module 330 may query the database 308, extracting all relevant notifications associated with that user. These notifications can be sent back to the requester in a structured format, such as JSON.
- users have the capability to mark a specific notification as 'read'. This may be accomplished by sending a PUT HTTP request to the system, specifying the unique identifier of the notification in question. Upon receipt of such a request, the system updates the status of the specified notification in the database 308.
- module 330 may include the following features: authentication and authorization mechanisms may be incorporated to ensure that users can only access and modify their own notifications; in some embodiments, broadcast notifications could be implemented, allowing a single message to be sent to multiple or all users; module 330 might support categorization or tagging of notifications, enabling users to filter and sort their messages based on topics or urgency; extending beyond text, notifications could support rich media, such as images, audio, video, or interactive content; to optimize user engagement, module 330 may incorporate machine learning algorithms or heuristics to determine the best time to deliver notifications; in some embodiments, module 330 could offer integration points with other platforms or services through APIs, allowing automated generation, delivery, or processing of notifications; and features like batch processing, where multiple notifications are processed as a group, may be added to enhance module 330 efficiency.
- the request may include data set details in the body, e.g. , in JSON format, although other formats like XML could also be supported.
- the module 332 might save the data to a persistent storage system in database 308.
- other operations like data validation, transformation, or enrichment, may be applied before saving.
- module 332 includes the following: user management endpoints could be introduced to allow for the creation, modification, or deletion of user profiles; endpoints for feedback and ratings may be developed, where users can submit feedback for data sets or rate them based on their quality and relevance; a notification system might be facilitated through specific endpoints, where external services can push notifications to be delivered to users, or users can fetch any notifications relevant to them; search endpoints can be implemented, allowing users or third-party services to search for data sets based on various criteria, like keywords, tags, or categories; pagination and filtering features may be added to list-based endpoints to allow users to retrieve a subset of results based on specific criteria or to navigate large result sets.
- Some embodiments include admin and moderation tools module 334, which may be used by marketplace staff to manage users, data sets, and system health. It may include tools for handling disputes, verifying data set quality, or removing inappropriate content.
- the module 334 may include a component that provides an interface, which in some embodiments could be a web-based dashboard, for administrators to view all data sets submitted by users. This interface may present data sets in a list, grid, or other formats and may provide search, fdter, or sorting capabilities to help administrators quickly locate specific data sets.
- the module 334 offers a verification mechanism. Through this mechanism, administrators can approve or verify data sets to vouch for their credibility or authenticity.
- Some embodiments may use a binary system (verified or not verified), or other embodiments might include multiple levels of verification, labels, or badges to indicate varying levels of trustworthiness or quality.
- the module 334 may also include functionality for marking data sets as containing inappropriate content. This feature may allow administrators to flag or hide data sets that do not comply with the platform's policies or standards. In some embodiments, once a data set is marked as inappropriate, it may be hidden from user view, or a warning could be displayed. In other embodiments, the data set might be entirely removed from the system. To further enhance the moderation process, the module 334 may offer a dispute handling mechanism. In the event of disagreements or conflicts between data providers and consumers, administrators can intervene and make decisions. Some embodiments may provide more sophisticated workflows, such as recording reasons for decisions, sending out notifications to involved parties, or integrating with third-party mediation services.
- Dispute records may be associated with data sets, feedback, or the like and viewable by users in some cases.
- these records and workflows could be extended to handle user reports, reviews, or feedback, allowing administrators to address concerns, resolve issues, or gather insights for platform improvements.
- some embodiments may support community-based moderation, where users or forum administrators participate in these workflows.
- module 334 may include the following features: In some embodiments, administrators may have the capability to select multiple data sets and perform bulk actions, such as verifying, flagging, or deleting them in one go; The system might record administrator actions, such as changes made, data sets viewed, or disputes handled, offering transparency and traceability; In certain embodiments, real-time notifications could alert administrators to urgent issues, new data set submissions, or significant platform events; An analytics dashboard could provide administrators insights into user behavior, popular data sets, and other platform metrics; The system may include tools to collect feedback from users on flagged data sets, offering administrators a more comprehensive view of potential issues. [0087] Some embodiments include infrastructure management module 336. This module 336 may manage server health, scalability, and resource allocation.
- a method may involve regularly monitoring the health or performance metrics of one or more servers that are part of an online platform's infrastructure. These servers may be physical or virtual machines, and they might be located in a single data center or distributed across multiple geographic locations. One way to ascertain the health or performance metrics of a server is by sending HTTP requests to an API endpoint provided by a cloud service provider or an infrastructure management platform.
- the module 336 may communicate with services such as AWS TM, GCP TM, Azure TM , or any other cloud or on-premise infrastructure provider to retrieve these metrics.
- the module 336 may take an action to adjust the resources allocated to that server. This action may involve increasing the computational power, memory, or storage of the server, or replicating that server and load-balancing between instances. For instance, if the CPU usage of a server exceeds 80%, the system may choose to scale up the resources for that server. On the other hand, if the CPU usage drops below a certain threshold, say 20%, the system may scale down the resources or terminate an instance of a server.
- a certain metric such as CPU (central processing unit) or memory usage
- the module 336 may take an action to adjust the resources allocated to that server. This action may involve increasing the computational power, memory, or storage of the server, or replicating that server and load-balancing between instances. For instance, if the CPU usage of a server exceeds 80%, the system may choose to scale up the resources for that server. On the other hand, if the CPU usage drops below a certain threshold, say 20%, the system may scale down the resources or terminate an instance of a server.
- the exact thresholds can vary
- module 336 may include other features, such as the following: A failover mechanism that redirects traffic or operations from a failing server to a healthy one; Integration with third-party monitoring tools, such as Datadog TM, New Relic TM, or Grafana TM, which might offer deeper insights or visualizations of the performance data; A logging mechanism that records all scaling actions, reasons for these actions, and the state of the system before and after the action; A backup system that periodically creates backups of data and configurations, ensuring data integrity and allowing for recovery in case of failures; A disaster recovery plan that defines steps to restore normal operations in catastrophic scenarios, such as data center outages; and the module 336, in some embodiments, may provide an interface for administrators to manually override scaling decisions or to set specific scaling policies based on expected events, such as planned maintenance or anticipated traffic spikes due to promotions or events.
- third-party monitoring tools such as Datadog TM, New Relic TM, or Grafana TM, which might offer deeper insights or visualizations of the performance data
- a logging mechanism that records all scaling
- FIG. 2 is a flow chart showing an example of a process 500 by which data may be discovered.
- the process 500 may be performed by the module 316 described above, or in some cases the process may be performed by other components.
- the steps of the process 500 may be performed in a different order from that illustrated: additional steps may be inserted; some steps may be repeated more times than others in a given iteration; and some steps may be performed concurrently with one another, or in some cases the steps may be performed serially, none of which is to suggest that any other description herein is limiting.
- the process like the other processes herein, may be implemented by computer-readable instructions stored on a tangible, non-transitory medium, such that when the instructions are executed by a computer system, the described functionality is effectuated.
- the process 500 may include obtaining a plurality of data sets from a plurality of different users of a data exchange as indicated by block 510.
- Some users (which may be human or other computer processes) may only write data sets, some may only read data sets, and some may do both.
- the data sets may also be referred to as data assets.
- the data sets may be obtained by different users uploading their data to the data exchange over time to share that data with other users of the data exchange.
- the data sets may be obtained from third-party computer systems that push data or from which data is pulled to constitute the data sets, examples including census data hosted by the US Government, computer logs, measurement logs of industrial process control systems, results of clinical studies, business data, and the like.
- the data exchange is the above-described server system 300 or the data exchange may take other forms.
- the data exchange provides a forum in which data sets can be shared, revised, commented upon, transformed, explored, analyzed, and otherwise used by a community of users.
- the users are from the same organization or in some cases the users are part of a broader community.
- access to different data sets may be limited by roles and permissions as described above.
- Data sets may each take a variety of different forms. Examples include tabular data, unstructured data, natural language text, semi-structured data, binary blobs, and the like.
- the data sets have a diversity of different schemas.
- the data sets take the form of relational databases export having a plurality of tables with key values by which different tables may be joined, for example, in third normal form.
- the data sets are semi-structured or unstructured documents such as JSON or XML files.
- the data sets are comma separated value (CSV) files.
- the data sets are key-value store exports, such as dictionaries.
- the data sets include a plurality of tables in each of at least some of the data sets, such as more than 5 or more than 10 tables in some subset of the data sets.
- some of the data sets may be relatively large, for example, bigger than 500 megabytes, 1 terabyte, 2 terabytes, or larger.
- the data sets may include both numeric and alphanumeric data in some cases.
- data in the data set is human-interpretable without passing through a codec, e.g., an image file itself typically would not be a data set, though data sets may include images, audio, video, or the like.
- candidate 706 has an icon 704 indicating one source for this proposed completion within the database 308, while candidate 710 includes a different icon 708 indicating a different source for this candidate.
- different sources having different icons may include previous queries or comments by the user, previous queries or comments by others, dataset titles, table titles, dataset field names, dataset values, and the like.
- some embodiments may, before a query begins to be entered, build and periodically refresh an index of proposed query completions with associated icon identifiers indicating the source of the candidate completions to facilitate relatively quick creation and evolution of the user interface 700, which in some cases may update with each additional character typed to reflect the narrowing of potential list of candidates.
- a database lookup may be performed using the processed query to search a database or multiple databases of possible completions.
- This database may comprise popular queries, the user's history for personalized suggestions, and real-time data like trending searches or news.
- the server in some embodiments, then compiles a list of suggestions based on the query, employing ranking algorithms to determine the most relevant completions. These suggestions, in some embodiments, are sent back to the client in a lightweight format such as JavaScript Object Notation for parsing by the browser.
- the server may implement caching mechanisms for frequent queries and their completions to improve response time for common searches.
- Additional features that developers might choose to add to such systems include personalized suggestion mechanisms based on user's past search history, integration with social media trends for real-time suggestion updates, voice recognition capabilities for hands-free typing, multilingual support for global user accessibility, and machine learning models that adapt to user behavior over time for more accurate predictions.
- load balancing may be implemented in server infrastructure to distribute the query load across multiple servers, ensuring efficient handling of a high volume of concurrent requests.
- Some embodiments store candidate queries in a data structure known as a trie, or prefix tree, to store and retrieve a set of words or phrases efficiently.
- the system may comprise a server-side component and a client-side component, each playing a distinct role in facilitating query completion.
- the server includes a trie data structure, which is characterized by nodes and edges where each node represents a character of the alphabet and each path from the root node to a leaf node represents a word or phrase stored in the trie.
- the trie may be capable of inserting new words or phrases, which involves creating new nodes for each character in the word or phrase that is not already present in the trie, and marking the end of the word or phrase on the final node.
- the server 300 may include an API endpoint that accepts query prefixes as input and returns a list of suggestions based on the contents of the trie.
- This API endpoint may be designed to handle HTTP GET requests, where the query prefix is passed as a query parameter.
- the server processes this request by searching the trie for words or phrases that start with the given prefix, which may involve traversing the trie from the root node and following the paths that match the characters in the prefix.
- the server collects the words or phrases that match the prefix and sends them back as a response to the client.
- the system may employ JavaScriptTM to capture user input and interact with the server-side API.
- the JavaScriptTM code may send the current input value to the server's API endpoint, using asynchronous web requests, and then display the suggestions returned by the server.
- the system may be extended or modified in various ways.
- the trie data structure on the server side may be augmented with additional features such as weight assignments to prioritize certain words or phrases, or the ability to handle fuzzy matching to account for typographical errors in the user input.
- the server component may also be designed to handle more complex queries or to integrate with databases or external APIs to retrieve the words or phrases for the trie.
- Figure 7 shows an example of a dataset detail history page 740 that may be shown on client computing devices as they engage with the server system 300 discussed above.
- the user interface includes a data history transaction list 742 with a plurality of transaction 744 in which the dataset at issue was changed.
- Each transaction 744 may include identifiers of those who participated in the change, a number of changes in the transaction, and a summary of what was changed in the dataset, along with an indication of when the change was made.
- Figure 8 shows an example of a table detail landing page 750 that may be shown on client computing devices responsive to selection of a table.
- the user interface 750 includes a description of the table 752, a list of datasets of which it is a part, and the table itself 754, which may include headings 756 identifying field names and values 758 for instances of those fields.
- Figure 10 is a diagram that illustrates an exemplary computing system 1000 in accordance with embodiments of the present technique.
- Various portions of systems and methods described herein may include or be executed on one or more computer systems similar to computing system 1000. Further, processes and modules described herein may be executed by one or more processing systems similar to that of computing system 1000.
- a processor may execute code (e.g., processor firmware, a protocol stack, a database management system, an operating system, or a combination thereof) that creates an execution environment for program instructions.
- a processor may include a programmable processor.
- a processor may include general or special purpose microprocessors.
- a processor may receive instructions and data from a memory (e.g., system memory 1020).
- Computing system 1000 may be a uniprocessor system including one processor (e.g., processor 1010a), or a multi-processor system including any number of suitable processors (e.g., lOlOa-lOlOn). Multiple processors may be employed to provide for parallel or sequential execution of one or more portions of the techniques described herein.
- Processes, such as logic flows, described herein may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating corresponding output. Processes described herein may be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- Computing system 1000 may include a plurality of computing devices (e.g. , distributed computer systems) to implement various processing functions.
- I/O device interface 1030 may provide an interface for connection of one or more I/O devices 1060 to computer system 1000.
- I/O devices may include devices that receive input (e.g., from a user) or output information (e.g., to a user).
- I/O devices 1060 may include, for example, graphical user interface presented on displays (e.g., a cathode ray tube (CRT) or liquid crystal display (LCD) monitor), pointing devices (e.g., a computer mouse or trackball), keyboards, keypads, touchpads, scanning devices, voice recognition devices, gesture recognition devices, printers, audio speakers, microphones, cameras, or the like.
- I/O devices 1060 may be connected to computer system 1000 through a wired or wireless connection.
- I/O devices 1060 may be connected to computer system 1000 from a remote location.
- I/O devices 1060 located on remote computer system for example, may be connected to computer system 1000 via a network and network interface 1040.
- Network interface 1040 may include a network adapter that provides for connection of computer system 1000 to a network.
- Network interface may 1040 may facilitate data exchange between computer system 1000 and other devices connected to the network.
- Network interface 1040 may support wired or wireless communication.
- the network may include an electronic communication network, such as the Internet, a local area network (LAN), a wide area network (WAN), a cellular communications network, or the like.
- System memory 1020 may be configured to store program instructions 1100 or data 1110.
- Program instructions 1100 may be executable by a processor (e.g., one or more of processors 1010a- 101 On) to implement one or more embodiments of the present techniques.
- Instructions 1100 may include modules of computer program instructions for implementing one or more techniques described herein with regard to various processing modules.
- Program instructions may include a computer program (which in certain forms is known as a program, software, software application, script, or code).
- a computer program may be written in a programming language, including compiled or interpreted languages, or declarative or procedural languages.
- a computer program may include a unit suitable for use in a computing environment, including as a stand-alone program, a module, a component, or a subroutine.
- a computer program may or may not correspond to a file in a file system.
- a program may be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g. , files that store one or more modules, sub programs, or portions of code).
- a computer program may be deployed to be executed on one or more computer processors located locally at one site or distributed across multiple remote sites and interconnected by a communication network.
- Embodiments of the techniques described herein may be implemented using a single instance of computer system 1000 or multiple computer systems 1000 configured to host different portions or instances of embodiments. Multiple computer systems 1000 may provide for parallel or sequential processing/execution of one or more portions of the techniques described herein.
- Computer system 1000 is merely illustrative and is not intended to limit the scope of the techniques described herein.
- Computer system 1000 may include any combination of devices or software that may perform or otherwise provide for the performance of the techniques described herein.
- computer system 1000 may include or be a combination of a cloud-computing system, a data center, a server rack, a server, a virtual server, a desktop computer, a laptop computer, a tablet computer, a server device, a client device, a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a vehicle-mounted computer, or a Global Positioning System (GPS), or the like.
- PDA personal digital assistant
- GPS Global Positioning System
- Computer system 1000 may also be connected to other devices that are not illustrated, or may operate as a stand-alone system.
- the functionality provided by the illustrated components may in some embodiments be combined in fewer components or distributed in additional components.
- the functionality of some of the illustrated components may not be provided or other additional functionality may be available.
- instructions stored on a computer-accessible medium separate from computer system 1000 may be transmitted to computer system 1000 via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network or a wireless link.
- Various embodiments may further include receiving, sending, or storing instructions or data implemented in accordance with the foregoing description upon a computer-accessible medium. Accordingly, the present techniques may be practiced with other computer system configurations.
- conditional relationships in which the antecedent is a necessary causal condition, the antecedent is a sufficient causal condition, or the antecedent is a contributory causal condition of the consequent, e.g., “state X occurs upon condition Y obtaining” is generic to “X occurs solely upon Y” and “X occurs upon Y and Z.”
- conditional relationships are not limited to consequences that instantly follow the antecedent obtaining, as some consequences may be delayed, and in conditional statements, antecedents are connected to their consequents, e.g., the antecedent is relevant to the likelihood of the consequent occurring.
- statements that one value or action is “based on” another condition or value encompass both instances in which the condition or value is the sole factor and instances in which the condition or value is one factor among a plurality of factors.
- statements that “each” instance of some collection have some property should not be read to exclude cases where some otherwise identical or similar members of a larger collection do not have the property, i.e., each does not necessarily mean each and every.
- a computer-implemented method comprising: obtaining, with a computer system hosting a data exchange, a plurality of data sets from a plurality of different users of the data exchange; receiving, with the computer system, a query to search among the plurality of data sets, the query specifying a token; determining, with the computer system, a group of data sets among the plurality of data sets that is responsive to the query, wherein determining the group of data sets comprises checking titles, tables, field names, and cell values of data sets among the plurality of data sets for use of the token; determining, with the computer system, a ranking of members of the group of data sets responsive to the query; and sending, with the computer system, in response to the query, summaries of at least some members of the group of data sets with instructions to present the summaries according to the ranking.
- any one of embodiments 1-3 comprising: for a given data set among the at least some members of the group of data sets, before receiving the query, determining a quality score of the given data set; and pre-computing summary statistics or visualizations of the given data set by: precomputing a first histogram of a first field of the given data set; precomputing a first measure of central tendency of the first field of the given data set; precomputing a second histogram of a second field of the given data set; and precomputing a second measure of central tendency of the second field of the given data set.
- the area-based visualization is a tree-map in which each area of the tree-map corresponds to a data set in the at least some members of the group, and wherein a size of each area corresponds to an amount of occurrences of terms of the query in the corresponding data set.
- determining the group of data sets responsive to the query comprises searching for tables responsive to the query across the plurality of data sets.
- any one of embodiments 1-14 comprising: receiving, with the computer system, from a first user, a selection of a first data set among the plurality of data sets, the first data set having a plurality of versions; receiving, with the computer system, a message from the first user to be shared with a second user in association with the first data set regarding collaboration between the first user and the second user on the first data set; pre-computing, with the computer system, summary statistics or visualizations for each of a plurality of fields of the first data set; causing, with the computer system, the message and the summary statistics or visualizations to be presented to the second user in association with the first data set; obtaining, with the computer system, another version of the first data set that has undergone revision and logging transformations to the first data set and comments on the transformations to form the another version in a log of changes associated with the first data set; and causing, with the computer system, a listing of versions of the first data set and at least some of the logged changes to be presented to the second user.
- the log indicates: an identifier of a user who created the first data set; an identifier of each user who modified the first data set in the versions of the first data set; what changes each user who modified the first data set made; and comments by at least some of the users who modified the first data set explaining their changes.
- determining the group of data sets comprises steps for searching a data marketplace; and determining the ranking comprises steps for ranking search results.
- any one of embodiments 1-18 comprising: receiving, from a user, a request to access a selected data set among the at least some members of the group of data sets; determining, based on a policy mapping permissions to the user or role of the user, that the user is not permitted to access a first subset of the selected data set and that the user is permitted to access a second subset of the selected data set; and in response to the determinations regarding access, masking the first subset of the selected data and sending the masked first subset of the selected data and the second subset of the selected data to the user.
- a tangible, non-transitory, machine -readable medium storing instructions that when executed by a data processing apparatus cause the data processing apparatus to perform the operations of any one of embodiments 1-19.
- a system comprising: one or more processors; and memory storing instructions that when executed by the processors cause the processors to effectuate the operations of any one of embodiments 1-19.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Provided is a process including: obtaining, with a computer system hosting a data exchange, a plurality of data sets from a plurality of different users of the data exchange; receiving, with the computer system, a query to search among the plurality of data sets, the query specifying a token; determining a group of data sets among the plurality of data sets that is responsive to the query, wherein determining the group of data sets comprises checking titles, tables, field names, and cell values of data sets among the plurality of data sets for use of the token; determining a ranking of members of the group of data sets responsive to the query; and sending in response to the query, summaries of at least some members of the group of data sets with instructions to present the summaries according to the ranking.
Description
DATA SET DISCOVERY IN DATA EXCHANGES
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application is related to US Provisional Patent Application No. 63/574,675, filed in the name of CSL Behring L.L.C, on 4 April 2023, the originally filed specification of which is hereby incorporated by reference herein in its entirety.
TECHNICAL FIELD
[0002] The present disclosure relates generally to methods, processes and systems for sharing data sets and, more specifically, to data set discovery in data exchanges.
BACKGROUND
[0003] A data marketplace is a platform where data providers can list and share their data sets, and data consumers can browse and download (in some cases upon purchase of a license or subject to an open-source license) these data sets for various purposes, such as analysis, machine learning, or business intelligence. The diversity and number of data sets in such platforms can be expansive, making it difficult for users to find, evaluate, and trust data sets that are available. Such systems are expected to become more important as more data becomes better with improved sensors, collection, and reporting; and as data use cases expand through machine learning, better statistical analysis tools, and more disciplined data- driven decision-making practices are adopted.
[0004] It is desired to address or ameliorate one or more disadvantages or limitations associated with the prior art, or to at least provide a useful alternative.
SUMMARY
[0005] The following is a non-exhaustive listing of some aspects of the present disclosure. These and other aspects are described in the following disclosure.
[0006] Some aspects include a computer-implemented method or process, including: obtaining, with a computer system hosting a data exchange, a plurality of data sets from a plurality of different users of the data exchange; receiving, with the computer system, a query to search among the plurality of data sets, the query specifying a token; determining, with the computer system, a group of data sets among the plurality of data sets that is responsive to the query, wherein determining the group of data sets comprises checking titles, tables, field names, and cell values of data sets among the plurality of data sets for use of the token; determining, with the computer system, a ranking of members of the group of data sets responsive to the query; and sending, with the computer system, in response to the query, summaries of at least some members of the group of data sets with instructions to present the summaries according to the ranking.
[0007] Some aspects include a tangible, non-transitory, machine-readable medium storing instructions that when executed by a data processing apparatus cause the data processing apparatus to perform operations including the above-mentioned process.
[0008] Some aspects include a system, including: one or more processors; and memory storing instructions that when executed by the processors cause the processors to effectuate operations of the above-mentioned process.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] Some embodiments of the present invention are hereinafter described, by way of example only, with reference to the accompanying drawings, in which like numbers indicate similar or identical elements:
[0010] Figure 1 illustrates a logical and physical architecture of a computing environment in which a data exchange may be implemented and used in accordance with some embodiments of the present techniques.
[0011] Figure 2 is an example of a process that may be executed by the system of figure 1 to facilitate data discovery in accordance with some embodiments of the present techniques.
[0012] Figure 3 illustrates an example of a process that may be executed by the system of figure 1 to provide a social graph overlay on information in a data exchange in accordance with some embodiments of the present techniques.
[0013] Figure 4 is an example user interface showing query completion in accordance with some embodiments of the present techniques.
[0014] Figure 5 is a user interface showing an example of a default home page in accordance with some embodiments of the present techniques.
[0015] Figure 6 is an example of a dataset detail landing page in accordance with some embodiments of the present techniques.
[0016] Figure 7 is an example of a dataset detail history page in accordance with some embodiments of the present techniques.
[0017] Figure 8 is an example of a table detail landing page in accordance with some embodiments of the present techniques.
[0018] Figure 9 is an example of a search results page in accordance with some embodiments of the present techniques.
[0019] Figure 10 is an example of a computing device by which computing systems may be composed to implement the techniques described herein.
[0020] While the present techniques are susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. The drawings may not be to scale. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the present techniques to the particular form disclosed, but to the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present techniques as defined by the appended claims.
DETAILED DESCRIPTION
[0021] Disclosed herein are solutions and, in some cases just as importantly, recognition of problems that had been previously overlooked (or not yet foreseen) by others, including in the fields of Computer Science and Human-Computer Interaction Design. Indeed, the inventors wish to emphasize the difficulty of recognizing those problems that are nascent and will become much more apparent in the future should trends in industry continue as the inventors expect. Further, because multiple problems are addressed, it should be understood that some
embodiments are problem-specific, and not all embodiments address every problem with traditional systems described herein or provide every benefit described herein. That said, improvements that solve various permutations of these problems are described below.
[0022] Some embodiments include a software as a service (SaaS) data marketplace, or other form of data exchange, with improved discovery of data sets for users. Some embodiments support searching for data sets by table across data sets with ranking of search results that accounts for absent data often plaguing lower-quality data sets. Some embodiments also account for downstream applications pulling data from these tables when forming the rankings, up-ranking those data sets and tables with greater usage. In some cases, search results are presented in a visually-compelling, area-based visualization like a tree map that supports zooming and panning to quickly and smoothly explore search results, e.g., in a web-browser or native application. Embodiments in some cases also surface frequency of query terms in the data sets to facilitate exploration of search results and implement recommendation systems to personalize recommendation of data sets for users. These aspects are described below with reference to Figure 2, after aspects are described of an example system in which they may be implemented with reference to Figure 1.
[0023] In some embodiments, these or other techniques may be implemented in a computing environment shown in Figure 1. The computing environment may include a server system 300, a third-party server 309, a client computing device 307, and the internet 305.
[0024] In some embodiments, the server system 300 may include a controller 302, a web server 304, an API (application program interface) server 306, and a database 308, along with the following: user authentication and authorization module 310; user profile management module 312; data upload and storage module 314; data catalog and discovery module 316; data quality and metadata management module 318; pricing and billing module 320; analysis tools integration module 322; data security and encryption module 324; audit and compliance module 326; feedback and ratings module 328; notification system module 330; APIs and integration endpoints module 332; administration and moderation tools module 334; and infrastructure management module 336. The controller 302 may coordinate the operation of the other listed components, e.g., by assigning tasks, sending data, and routing results, instructions, and data for user interfaces through the web-server 304 or API server 306, which in some cases, maybe non-blocking servers configured to support concurrent sessions with a plurality of remote computing devices.
[0025] In some embodiments, the system 300 may use a software framework or platform, which may be Flask™, Django™, Express™, Ruby on Rails™, or any suitable software development framework may be employed to achieve analogous functionalities. The system 300, in certain implementations, may utilize a database interaction module or solution. Examples include SQLAlchemy ™, MySQL ™, PostgreSQL ™, MongoDB ™, Oracle ™, or other database management systems might be employed to store, retrieve, and manage user profile data in database 308.
[0026] Some embodiments may include a user authentication and authorization module 310. In some embodiments this module 310 ensures (e.g., verifies or increases the likelihood) that users are who they claim to be. This may involve username and password combinations, OAuth integration with third-party services, or multi -factor authentication, e.g, with passkeys, webauthn, FIDO 2, or other suitable protocols. Once authenticated in some embodiments, module 310 may also determine what actions a user is permitted to perform, like uploading data, purchasing access, etc. A user model, in some embodiments, may be defined with various attributes. One of the attributes may include an is admin attribute or an equivalent flag to distinguish regular users from administrative users. However, other attributes or methods of distinction may be utilized, such as roles, groups, or permissions levels. In some cases, the system 300 supports multiple tenants (e.g, businesses or organizations), and tenant-specific accounts may support roles and permissions specific to their account, e.g., limiting certain forms of access (for instance, read or write) to only their employees or only those employees having a certain role.
[0027] For the process of user login, in some embodiments, the module 310 may accept credentials, which may include a username and password entered and submitted via a log-in interface on device 307 or submitted with an API request from system 309. The provided credentials may be checked against a database, like database 308. In some embodiments, the password, rather than being stored in plain text, may be hashed. Other mechanisms, such as token-based authentication, biometric verification, or two-factor authentication, may be implemented as alternatives or in addition to password checking.
[0028] Once authenticated, users may be directed to a dashboard, search interface, or similar interface like those discussed below. In some embodiments, the module 310 may employ authorization mechanisms to distinguish the functionalities available to regular users versus administrative users based on the aforementioned is_admin attribute or other determining
factors. While a dashboard is one possible interface, other embodiments may use landing pages, portals, or other user interfaces tailored to the authenticated user's permissions. For user logout, the module 310 in some embodiments may invalidate the user's session, although other mechanisms, such as token expiration or user-driven session termination, might also be employed.
[0029] Some embodiments may include user profde management module 312. This module 312 may allow users to create, update, and manage their profde information. Some embodiments may also allow users to set preference, payment details, and other aspects of their profde, along with tenant employer and role therein. In some embodiments, the user profde data may encompass various attributes. A representative attribute set might be a username, an email, and user preferences, among others. However, other attributes may be included, such as profde images, user histories, associated devices, or any other relevant user-specific data. For simplicity, the user preferences in one embodiment might be stored as a string formatted in JSON (JavaScrip ™ object notation). In other embodiments, the preferences could be stored in structured tables, XML (extensible markup language) strings, or other data formats suitable for capturing user preferences and settings.
[0030] The module 312 may offer an interface or portal for the user to view and modify their profde details. In one embodiment, this interface may be web-based, accessible through standard web browsers on device 307. However, in other embodiments, it could be accessed through dedicated applications on devices like smartphones, tablets, desktop applications, or other computing devices. To modify profde details, in some embodiments, a user may, via device 307, provide input through form fields or equivalent input methods, and then submit the data for processing. Once submitted, the system may validate, sanitize, and save the updated profde data to the database. The precise method of capturing and processing user input may vary, including interactive forms, voice commands, touch gestures, or any other suitable input mechanism.
[0031] Module 312 may include the following: validation and sanitation mechanisms to ensure the integrity and security of user data; multi-faceted user preferences handling for more granular user settings; integration with third-party services or platforms for extended functionalities; notification mechanisms to inform users about successful or unsuccessful profde updates; and backup and recovery functionalities to preserve and restore user profde data, in some embodiments.
[0032] Some embodiments may include a data upload and storage module 314. In some embodiments, this module 314 provides interfaces for users to upload data sets. The module 314 may also manage the storage of these data sets, in some cases ensuring redundancy and including features for versioning of data sets. To facilitate data upload, the module 314 may provide an interface, like a portal or endpoint. In some embodiments, this interface could be web-based, accessible through standard web browsers executing on device 307. However, alternative embodiments might offer interfaces accessible through dedicated applications, APIs, or endpoints compatible with different devices, such as smartphones, tablets, desktop computers, or other computing platforms, like third party server 309.
[0033] The user might be able to select and submit their data through such an interface exposed by module 314. Once submitted, the module 314, in some embodiments, may validate the provided data, ensuring appropriate format and content. After validation, the data may be stored in a predefined or dynamically determined location. The data could be stored locally on server system 300. In some embodiments, the system 300 might use cloud storage solutions, distributed databases, content delivery networks (CDNs), local cache on device 307, or other scalable storage options.
[0034] The module 314 might also maintain a record, such as a catalog, of the uploaded data sets. Each data set may be associated with metadata, potentially including a filename, date of upload, source, description, characteristic statistics (like measures of central tendence and measures of variation) computed by module 314 upon ingest, or any other relevant data attributes. The storage of such metadata can provide a means for future retrieval, organization, or analysis of the uploaded data sets. Module 314 may perform the below-described indexing operations to facilitate search operations described with reference to Figure 2 upon upload or as a batch process performed periodically.
[0035] Module 314 may also include the following: validation mechanisms to handle various data formats and ensure data integrity; integration with user authentication systems to assign data ownership and access permissions; mechanisms to handle larger data sets, such as chunking, compression, or streaming uploads; notifications or alerts to inform users about the status of their data upload; backup, redundancy, and recovery functionalities to ensure data durability and availability.
[0036] Some embodiments may include data catalog and discovery module 316. In some embodiments, module 316 may provide search functionality for users to discover data sets of interest. Some embodiments may also include a catalog system for cataloging and tagging data sets. Some embodiments may generate data previews used to relatively quickly provide samples of data before the full data set is accessed. The module 316 may peform the search operations described below with reference to Figure 2.
[0037] A collection of data, referred to herein as the “data set,” may be implemented to represent and store details about available data sets. This object may contain attributes including, but not limited to (which is not to suggest other lists are limiting), a unique identifier, a name, a description, an associated filename, and associated tags. In some embodiments, the object may include additional attributes such as upload dates, data sizes, user ownership, or other relevant metadata.
[0038] To facilitate user discovery of data sets, in some embodiments, the module 316 may provide an interface, such as instructions like HTML, JavaScript ™, and the like by which a web interface is created and changed on device 307. This interface might allow users to view a list of all available data sets. Furthermore, search functionality may be incorporated, allowing users to query based on data set tags or other attributes. In some embodiments, this search capability might use string matching, but in other embodiments, other search mechanisms, like full-text search, machine learning-based recommendations, embedding vector search, or filtered searches might be employed, examples of which are described below in greater detail with reference to Figure 2. Upon selecting a data set from the list or search results, users might access detailed views of the data sets, displaying extensive metadata and possibly offering download or access links, again with greater detail below provided with reference to Figure 2.
[0039] Some embodiments of module 316 may include the following: mechanisms to handle pagination or lazy loading for displaying extensive lists of data sets; enhanced user interfaces, possibly interactive, using technologies like AJAX, WebSockets, or single page application (SPA) frameworks; integration with user authentication systems to personalize data set listings or manage access permissions; analytics to track and present data set popularity, access frequencies, or user interactions; and integration with third-party platforms or services to extend data set sourcing, storage, or accessibility.
[0040] In some embodiments, a function or other method may generate metadata for the data sets. The generation can be based on the analysis of the raw data. For example, for data sets in CSV (comma separated value) format, the metadata might include row count, column count, and column names. In other embodiments, the metadata may comprise statistics such as mean, median, mode, standard deviation, or any other descriptive statistic related to the data. Other examples include column (or field) specific visualizations, like histograms, spark lines, box plots, or the like. In some cases, data sets, tables, or sets of values of fields (e.g., columns or rows) may be submitted to the ChatGPT ™ code interpreter plugin with a prompt requesting some form of data analysis, like a request to generate a heat map of the world or North America when geographic data is detected. The resulting visualizations or other forms of analysis may be included in the meta data and displayed in a similar manner as the user navigates from the search results to the data set itself, e.g., upon the user selecting the data set in a tree map, upon the users selecting a table therein, or upon the user selecting a field therein (or pairs of fields or larger collections to view relationships). In some cases, such metadata may be precomputed before receiving the query or request to navigate to a view of a user interface that displays the metadata.
[0041] Some embodiments may populate a “my recommended data” interface or similar default interface to facilitate discovery of data sets with a recommendation engine of module 316. Examples may surface data with which you have interacted previously and has subsequently been revised or commented upon, up-weighting based on freshness of such edits in ranking. One such method may employ collaborative filtering techniques. These techniques might operate by examining the behavior or preferences of multiple users within the data marketplace or other form of data exchange. In certain implementations, the method may focus on identifying users that are similar to a target user based on their past interactions with data sets, such as data sets they have purchased, rated, or viewed. Once a set of similar users is identified, the system can aggregate data sets that those users have shown a preference for and subsequently recommend a subset of those aggregated data sets to the target user.
[0042] In some variations of collaborative filtering within the context of a data marketplace, the method may identify data sets that are similar to a given data set based on user interactions. This approach may be referred to as data set-based collaborative filtering. Once similar data sets are recognized, the system may aggregate users who have shown a preference for these
data sets, and then use this aggregated data to recommend a subset of these data sets to the target user.
[0043] Another approach within a data marketplace, in some embodiments, involves contentbased filtering. Content-based filtering techniques may focus on the features or metadata of data sets and a profile of the user's preferences. In certain implementations, data sets that the user has shown interest in may have their features or metadata weighted more heavily in the user's profile, while those that the user has shown disinterest in may receive a negative weighting. The system may then compare the features or metadata of unacquired data sets to this weighted profile to generate a score for each data set. Data sets with the highest scores, indicating a greater alignment with the user's profile, may then be recommended to the user.
[0044] In some embodiments, hybrid methods may be used which combine aspects of both collaborative filtering and content-based filtering to suggest data sets within a data marketplace. One potential approach may involve separately running collaborative filtering and content-based models and subsequently combining their scores to generate a final recommendation score for each data set. The combination of scores may be achieved through various methods including but not limited to a weighted sum, averaging, or a more complex function that takes into account other factors.
[0045] Furthermore, matrix factorization techniques may be utilized in certain embodiments, especially where the user-data set interaction matrix within the marketplace is decomposed into multiple matrices representing latent factors. Predictions derived from the decomposition can help in suggesting data sets to users. Matrix factorization may involve approximating a useritem interaction matrix by decomposing it into multiple matrices, which may represent latent or hidden features. These latent features could provide a condensed representation and might capture patterns that are not immediately apparent in the initial matrix.
[0046] The user-item interaction matrix, in some instances, might be of size m x n, where m represents the total number of users and n represents the total number of items, such as data sets. The individual entries in this matrix may represent the interaction or preference of a specific user towards a specific item. However, it is to be noted that many of these entries could be missing or undefined, given that not every user may interact with every item.
[0047] To address the missing values and uncover underlying patterns, in some embodiments, the matrix may be factorized into two distinct matrices. One matrix, often referred to as P,
might be of size m x k and could represent the association or affinity between users and certain features. Similarly, a second matrix, which might be labeled as Q, could be of size n x k and may denote the association between items and the same set of features. The product of these matrices (or a suitable transformation of one, such as its transpose) might serve to approximate the original user-item interaction matrix.
[0048] The process of factorizing the original matrix into these two matrices might involve minimizing the difference or error between the product of these matrices and the known entries in the original matrix. This error, in some implementations, could be quantified using specific loss functions. One common approach could be to use a function that captures the Mean Squared Error (MSE) between known interactions in the original matrix and the corresponding values in the product of the factorized matrices. To ensure that the values in the matrices P and Q remain controlled and don't reach extremes, regularization terms might be integrated into the loss function in certain embodiments.
[0049] Various optimization methods might be employed to solve this minimization problem. Gradient descent is one potential approach that could be adopted. In other embodiments, stochastic gradient descent (SGD) or alternating least squares (ALS) might be preferred, given their specific advantages in certain scenarios. It is to be noted that while these methods are mentioned, other optimization techniques not expressly described herein might also be employed to achieve similar objectives.
[0050] Upon successful factorization, the latent features captured in the matrices may not inherently possess explicit meanings. However, they may encapsulate distinct patterns in useritem interactions. For instance, within a marketplace for books, these latent features might implicitly represent genres, themes, or author styles, even if such labels are not explicitly associated.
[0051] With the matrices P and Q derived, predictions for unknown or missing entries in the original matrix can be made by computing their product or an appropriate transformation thereof. These predictions, in turn, could facilitate the generation of personalized recommendations for users of data sets.
[0052] Developers, in some embodiments, might introduce additional features to augment the matrix factorization process. For instance, biases associated with specific users or items could be incorporated into the factorization model to account for inherent tendencies. Additionally,
temporal dynamics, capturing how user preferences evolve overtime, might be integrated into the model. Furthermore, external information, such as metadata about items or user demographics, could also be fused into the factorization process to provide more context-aware recommendations of data sets.
[0053] Deep learning techniques may be adopted within a data marketplace in some embodiments. Neural networks, possibly including architectures such as autoencoders, may be trained on user-data set interaction data. The latent representations captured during the training may help compute similarity scores between data sets, users, or both, to further enhance data set recommendations.
[0054] Contextual recommendations within a data marketplace, in some embodiments, may consider variables such as the time of day, user's recent search history, the specific sector or field of the user, or any other contextual data that may be relevant. This might allow for more tailored data set suggestions, such as recommending finance-related data sets during market hours or data sets related to a recent news event if the user has shown interest in that area.
[0055] Developers may also choose to integrate additional features into the recommendation process of the data marketplace. For example, algorithms can incorporate rules to ensure diverse data set recommendations or to avoid recommending data sets the user has recently viewed or acquired. In some scenarios, the recommendation engine might also consider factors like global research trends, emerging fields of study, or external news and events that could influence data set preferences.
[0056] Some embodiments may include data quality and metadata management module 318. This module 318 and some embodiments may generate or allow input of metadata such as descriptive information for data sets. Some embodiments of module 318 may provide tools or processes to assess and rate the quality or reliability of data sets as well. In certain embodiments, the system may allow external input of custom metadata. The input metadata might contain details like description, source of the data set, creation date, or any other relevant information. The externally provided metadata may be integrated with the generated metadata.
[0057] In some embodiments, module 318 may itself or through input from users assess the quality of the data sets. The assessment may be based on various criteria. In one example, the quality is determined by the number of rows in the data set. A data set with more than a threshold number, say 1000 rows, may be deemed of high quality. In other embodiments,
quality assessment might be based on the completeness of the data, absence of null or missing values, consistency in data patterns, or any other relevant factor. Some embodiments may employ machine learning models or statistical techniques to determine data set quality. In certain embodiments, module 318 may allow for the retrieval of the generated or inputted metadata. This provides users or other systems with descriptive information about the data sets.
[0058] In some embodiments, once the data set's quality is assessed, a quality score may be generated and stored in association with the data set. This score might be on a scale, such as 1 to 5. Embodiments may retrieve and present this quality score, giving insights into the perceived value or reliability of the data set. Some embodiments may use this score for ranking data sets to return in response to a query, up-ranking data sets designated as being higher- quality.
[0059] In some embodiments, metadata might also include information like fde size, data type distributions, unique values count, or any other relevant metric, and the quality assessment can be based on numerous other factors, including data freshness, source reliability, historical accuracy, or user feedback. In some cases, the quality score might be represented in different formats, such as stars, grades, or descriptive labels like “High Quality” or “Low Quality.”
[0060] Some embodiments include pricing and billing module 320. This module 320 in some embodiments allows data providers to set prices to access their data sets. Some embodiments handle transactions when a user wants to purchase access and manage billing, in some cases with subscription or tiered payment model supported. In some embodiments, data sets may be represented as instances of a data set class. Each data set may possess an identification number (id), a descriptive name (name), and an associated price (price). Additionally, users of the marketplace may be represented as instances of a user class, which may have an identification number (id), a name (name), and a balance (balance) indicating their available funds.
[0061] In some embodiments, a data marketplace hosted by server system 300 may include a collection or database 308 of data sets and users. This could be implemented using relational databases, dictionaries, arrays, linked lists, or any suitable data structure. The module 320 may include a method to set or modify the price of a data set. In some embodiments, the set data set_price method allows an authorized entity, possibly the data set owner or an admin, to specify a new price for a data set.
[0062] In some embodiments, a user may wish to purchase access to a data set. The module 320 may facilitate this transaction. This module 320 may check the user's balance against the data set's price, deduct the amount if sufficient funds are available, and then grant the user access to the data set. Users may have the ability to view their current balance. Additionally, they may be able to increase their balance through a top-up or deposit feature. The module 320 may allow a user to add funds to their account.
[0063] In some embodiments of module 320, a transaction history may be maintained for each user, allowing them to view their past purchases and top-up actions, and a promotional or discount mechanism may be integrated, allowing data set providers or the marketplace administrator to offer data sets at discounted prices for limited periods. Some embodiments of module 320 may implement tiered pricing structures, allowing users to choose from different access levels or data quality tiers at varying prices. Refund mechanisms may be provided by module 320, allowing users to request a refund within a specific window after purchase if they find the data set unsatisfactory. In some embodiments, module 320 may support subscription models, where users pay a regular fee for access to a collection of data sets or for enhanced features within the marketplace. Integration with external payment gateways or financial institutions may be incorporated by module 320 to facilitate transactions, manage user balances, and handle currency conversions if the marketplace operates across multiple countries.
[0064] Some embodiments include analysis tools integration module 322. Module 322 may allow users to perform basic or complex analytics on data sets, in some cases integrating with third-party tools. Some embodiments may include visualization tools for graphical representation of data or transformations thereon. Module 322, in certain embodiments, is equipped with an endpoint to accept data uploads. For instance, this endpoint may be designed to accept files in a CSV format, though in other embodiments, other formats like JSON, Excel™, Parquet, or database dumps may be accommodated. Once data is uploaded, it may be stored temporarily in memory for quick processing, although other storage mechanisms like databases, file storage, or cloud storage solutions can also be employed.
[0065] In some embodiments of module 322, after data has been uploaded, a user may request statistics about the data set. The statistics generated might include measures of central tendency like mean, mode, or median; measures of spread like standard deviation or variance; and counts
of data points, among others. In some embodiments, other statistics or data summaries could be computed, such as mode, skewness, kurtosis, or custom-defined metrics.
[0066] To further enhance user insight, the module 322, in certain embodiments, provides visualization capabilities. As an example, a user might request a histogram of a particular data column, but bar charts, line charts, scatter plots, or more complex visualizations may also be provided. In some embodiments, matplotlib is employed to generate these visualizations, or other visualization libraries or tools such as Seaborn ™, Plotly™, D3.js™, or even integrated services like Tableau™ or PowerBI™ could also be used. The visualizations, once generated, may be converted into a suitable format for web transmission. In one embodiment, a PNG (portable network graphics) image format is employed. However, other formats like JPEG (Joint Photographic Experts Group), GIF (Graphics Interchange Format), SVG (Scalable Vector Graphics), or even interactive web formats like HTML (hypertext markup language) or JavaScript™-based visualizations can be used.
[0067] Moreover, in some embodiments, other forms of analytical processing might be offered. This could include integration with machine learning libraries like TensorFlow™, Scikit- leam™, Code Interpreter features from OpenAI™, or the like, allowing users to perform tasks like clustering, regression, or classification on the uploaded data. Additionally, the system may interface with tools like Jupyter™ notebooks, allowing users to conduct more bespoke, in- depth analyses.
[0068] In certain embodiments, error handling and data validation are incorporated to enhance system robustness. This might involve checking the uploaded data for consistency, missing values, or potential anomalies. The module 322 could provide feedback or even suggestions for data cleaning and preprocessing.
[0069] Furthermore, in some embodiments, to enhance user collaboration and insight sharing, features like saving visualizations, sharing results with other users, or even publishing findings to a wider audience may be incorporated. Additionally, APIs and integration endpoints may be provided, allowing third-party software or services to interact with the analysis tool, offering automation capabilities, integration with other platforms, or extending the module 322’s functionalities.
[0070] Some embodiments of data security and encryption module 324 may use encryption techniques to safeguard sensitive data, ensuring confidentiality, integrity, and availability of
data. In some embodiments, the module 324 utilizes symmetric encryption for securing data at rest. A key, which may be derived from a password or other secret, may be used to both encrypt and decrypt the data. The generation of this symmetric key may involve the use of a salt, which is may include a random sequence of bytes. This salt, when combined with the password, can be input to a Key Derivation Function (KDF) to produce the symmetric key. One possible KDF that may be used is PBKDF2HMAC, although other KDFs such as Argon2, scrypt, or bcrypt might also be used depending on system requirements. In some embodiments, the salt is generated using a cryptographic random number generator. The generated salt may be stored separately from the encrypted data, and in some implementations, it might be stored alongside the encrypted data, typically as a prefix to the ciphertext. In certain embodiments, the password used for deriving the symmetric key is not hard-coded within the system but is obtained from secure external sources or inputs, ensuring dynamicity and enhanced security. This password might be provided by an end user, retrieved from secure environment variables, or fetched from secure key management systems.
[0071] For securing data in transit, asymmetric encryption methods might be employed in some embodiments of module 324. Asymmetric encryption, such as public-key cryptography, may involve the use of a pair of keys: a private key, which remains confidential, and a public key, which may be shared openly. Data encrypted with the public key can only be decrypted with the corresponding private key, and vice versa. In some embodiments, the RSA encryption algorithm is utilized for asymmetric encryption, although other algorithms like Elliptic Curve Cryptography (ECC) or Diffie-Hellman might also be considered. The RSA key pair can be generated with varying key sizes, depending on the desired security level. While a 2048-bit key size might be used in many scenarios, larger key sizes like 3072-bit or 4096-bit may be selected for heightened security environments.
[0072] Data encrypted for transit might employ specific padding schemes. In some embodiments, the OAEP (Optimal Asymmetric Encryption Padding) is used alongside a mask generation function such as MGF1 (mask generation function) and a hash function like SHA256. However, other padding schemes and hash functions may be employed based on specific requirements. In some embodiments, the module 324 might provide functionalities to serialize the generated RSA keys into standard formats, such as PEM, for storage or transmission purposes. The private key, due to its sensitive nature, may be stored securely using hardware security modules, encrypted filesystems, or other secure storage mechanisms.
Additionally, in some embodiments, the module 324 may include features to enhance the encryption process. Features like data compression prior to encryption, chunking large data sets into manageable blocks for efficient encryption, or adding data integrity checks using cryptographic signatures or Message Authentication Codes (MACs) might be implemented. Moreover, the system might offer ways to rotate keys regularly, ensuring that even if a key is compromised, its window of vulnerability is limited. In some embodiments, robust logging and alerting mechanisms may be integrated to notify system administrators or users of any encryption-related anomalies.
[0073] Some embodiments include audit and compliance module 326, which in some cases may keep track of who accesses what data and when, for instance with access logs in the database 308. Some embodiments may assist in ensuring that the marketplace adheres to relevant regulations and standards like General Data Protection Regulation (GDPR) or Health Insurance Portability and Accountability Act (HIPAA). The system may include audit records and a scheduled process responsible for interfacing with the underlying database 308 and performing compliance checks.
[0074] The audit record, in some embodiments, may comprise a user identifier, an action, a data set identifier, and a timestamp. The user identifier can be any form of unique identification associated with a user in the data marketplace. The action may be a representation of any action performed by the user, such as accessing a data set, modifying a data set, or any other relevant activities. The data set identifier can be an identification number or reference associated with a specific data set in the marketplace. The timestamp, in some embodiments, may capture the exact time when the action was performed. In some embodiments, the audit record can also include other attributes such as an IP (internet protocol) address from which the action was performed, a device identifier, and potentially other meta information that provides context to the action. A compliance manager process may, in some embodiments, interface with database 308 to store and retrieve audit records. The compliance manager, in some embodiments, may be responsible for creating database tables to store the audit records. When adding a new record, the compliance manager process may save the user identifier, action, data set identifier, and the timestamp to the database. In other embodiments, batch processing can be implemented where multiple audit records are stored in the database at once to optimize database transactions.
[0075] In some embodiments, the module 326 may allow for retrieval of audit records. This retrieval can be general or filtered based on certain attributes like a specific user identifier. The retrieval functionality can be extended to include more complex query operations, leveraging the capabilities of the underlying database system. A compliance check function may also be provided. In some embodiments, this function checks the actions of a user against certain compliance criteria. For instance, it may ensure that a user has not exceeded the number of accesses for a particular data set. The criteria for compliance checks can be highly configurable, and in other embodiments, they may involve checking against regional data access regulations, adherence to user or data set-specific access controls, or any other relevant checks.
[0076] Some embodiments of module 326 may include the following: alerts or notifications when non-compliance is detected; integration with third-party compliance monitoring tools or services; machine learning or statistical analysis methods to detect anomalous behavior or potential breaches; backup and restore functionalities for the audit records; real-time (e.g. , with less than 2 minutes of latency) monitoring and reporting dashboards; user interfaces or API endpoints for administrators or auditors to interact with the system; filtering and search capabilities on the audit records, leveraging techniques such as full-text search or relational queries. Additionally, in some embodiments, the module 326 may also provide for data anonymization, ensuring that audit records themselves comply with data protection regulations. Encryption mechanisms may be employed to further protect the integrity and confidentiality of the audit records.
[0077] Some embodiments may include feedback and ratings module 328. This module may allow users to rate and review data sets associating those forms of feedback with the data sets in database 308. Some embodiments may help users assess the data sets value and quality. Some embodiments may allow data providers to receive feedback to improve their offerings as well. The module 328 may be configured to allow users to submit feedback about data sets, comprising both a numerical rating and a textual comment. Furthermore, the system may be designed to retrieve and display such feedback to other users to aid in their evaluation of data sets.
[0078] In some embodiments, the module 328 includes a portion of database 308 configured to store feedback entries. Each feedback entry may include an identifier for the data set, an identifier for the user submitting the feedback, a numerical rating, and a textual comment. In some embodiments, the system may provide an HTTP endpoint configured to receive POST
requests for submitting feedback. When a POST request is received at this endpoint, the system may extract details of the feedback from the request, create a new feedback entry in the database, and respond with a confirmation. The details extracted from the request may include, but are not limited to, the data set identifier, user identifier, rating, and comment. Additionally, in some embodiments, the system may provide an HTTP endpoint configured to handle GET requests to retrieve feedback for a specific data set. Upon receiving a GET request at this endpoint, the system might query the database for all feedback entries associated with the specified data set and return the results.
[0079] For enhancing the usability and robustness of the system, in some embodiments, user authentication mechanisms may be integrated, e.g., with module 310 above. Before submitting or retrieving feedback, users might be required to authenticate themselves. The authentication can be based on various methods, such as usemame/password combinations, OAuth integration, multi -factor authentication, or other authentication methods. To ensure the integrity and reliability of the feedback, in some embodiments, the system may include a mechanism for verifying the authenticity of feedback. For instance, after submitting feedback, a user might receive an email prompting them to confirm their submission. Only after receiving such confirmation might the feedback be displayed to other users. In certain embodiments, the system might offer features for users to edit or delete their previously submitted feedback. An HTTP endpoint for handling PUT or DELETE requests could be provided for this purpose. To provide users with a summarized view of feedback for a data set, in some embodiments, module 328 may compute and display aggregate metrics, such as the average rating, the number of feedback entries, or the distribution of ratings. In some embodiments, the feedback system might be augmented with a feature allowing users to reply to or comment on feedback entries, enabling a threaded discussion about a data set's quality, relevance, or other aspects. Furthermore, in certain embodiments, to improve the relevance of feedback, the system may employ machine learning or other algorithmic methods to analyze and filter feedback, highlighting particularly helpful or insightful entries and perhaps deprioritizing or hiding feedback deemed to be of lesser relevance or quality. In addition, in some embodiments, feedback entries may be tagged with categories, topics, sentiment, or other labels, indicating specific attributes or characteristics of the feedback, such as “positive,” “negative,” “technical issue,” or “data inconsistency.”
[0080] Some embodiments include notification system module 330. This module 330 may establish a table or other collection of notification records, which may contain multiple fields or columns such as id, user_id, message, and status. The id may serve as a unique identifier for each notification, ensuring individual addressing. The user id may be used to identify the intended recipient of the notification. The message field may store the actual content or body of the notification, while the status field can indicate whether the notification has been read or remains unread by the user. In some embodiments, when anew notification is to be generated, the system receives a POST HTTP request. This request may include data such as the user's identifier and the content of the message to be delivered. Upon receiving this request, the system can verify the data, ensuring the presence of necessary information. Following verification, the system may insert the new notification into the database, assigning it a unique identifier and marking its initial status as 'unread'. Users or recipients can retrieve their notifications by sending a GET HTTP request via device 307, specifying their unique user identifier. In response, the module 330 may query the database 308, extracting all relevant notifications associated with that user. These notifications can be sent back to the requester in a structured format, such as JSON. In some embodiments, users have the capability to mark a specific notification as 'read'. This may be accomplished by sending a PUT HTTP request to the system, specifying the unique identifier of the notification in question. Upon receipt of such a request, the system updates the status of the specified notification in the database 308.
[0081] Some embodiments of module 330 may include the following features: authentication and authorization mechanisms may be incorporated to ensure that users can only access and modify their own notifications; in some embodiments, broadcast notifications could be implemented, allowing a single message to be sent to multiple or all users; module 330 might support categorization or tagging of notifications, enabling users to filter and sort their messages based on topics or urgency; extending beyond text, notifications could support rich media, such as images, audio, video, or interactive content; to optimize user engagement, module 330 may incorporate machine learning algorithms or heuristics to determine the best time to deliver notifications; in some embodiments, module 330 could offer integration points with other platforms or services through APIs, allowing automated generation, delivery, or processing of notifications; and features like batch processing, where multiple notifications are processed as a group, may be added to enhance module 330 efficiency.
[0082] Some embodiments include APIs and integration endpoints module 332. APIs may facilitate machine-to-machine exchanges. Examples might expose an endpoint that manages data sets in database 308. When a GET request is made, the module 332 may retrieve a list of all data sets. If a specific data set is to be retrieved, the system could accept a unique data set identifier appended to the endpoint, e.g., 7api/data sets/<data set_id>'. In certain implementations, the module 332 may connect to database 308 to fetch the data set details. For the creation of a new data set, a POST request might be sent, e.g. , to 7api/data sets'. The request may include data set details in the body, e.g. , in JSON format, although other formats like XML could also be supported. Upon receipt, the module 332 might save the data to a persistent storage system in database 308. Depending on system design and requirements, other operations, like data validation, transformation, or enrichment, may be applied before saving.
[0083] Updating an existing data set may be achieved by a PUT request to the specific data set endpoint, for example, 7api/data sets/<data set_id>'. The request may carry the updated data set details, and on receipt, the system could modify the existing data set in its storage. Deleting a data set might be effectuated by a DELETE request to the data set's unique endpoint. In response, the system may remove the data set from its storage system, or perhaps mark it as inactive, depending on the data retention policies in place. Other examples of features that may be incorporated in module 332 include the following: user management endpoints could be introduced to allow for the creation, modification, or deletion of user profiles; endpoints for feedback and ratings may be developed, where users can submit feedback for data sets or rate them based on their quality and relevance; a notification system might be facilitated through specific endpoints, where external services can push notifications to be delivered to users, or users can fetch any notifications relevant to them; search endpoints can be implemented, allowing users or third-party services to search for data sets based on various criteria, like keywords, tags, or categories; pagination and filtering features may be added to list-based endpoints to allow users to retrieve a subset of results based on specific criteria or to navigate large result sets.
[0084] Some embodiments include admin and moderation tools module 334, which may be used by marketplace staff to manage users, data sets, and system health. It may include tools for handling disputes, verifying data set quality, or removing inappropriate content. The module 334 may include a component that provides an interface, which in some embodiments could be a web-based dashboard, for administrators to view all data sets submitted by users.
This interface may present data sets in a list, grid, or other formats and may provide search, fdter, or sorting capabilities to help administrators quickly locate specific data sets. In certain embodiments, the module 334 offers a verification mechanism. Through this mechanism, administrators can approve or verify data sets to vouch for their credibility or authenticity. Some embodiments may use a binary system (verified or not verified), or other embodiments might include multiple levels of verification, labels, or badges to indicate varying levels of trustworthiness or quality.
[0085] The module 334 may also include functionality for marking data sets as containing inappropriate content. This feature may allow administrators to flag or hide data sets that do not comply with the platform's policies or standards. In some embodiments, once a data set is marked as inappropriate, it may be hidden from user view, or a warning could be displayed. In other embodiments, the data set might be entirely removed from the system. To further enhance the moderation process, the module 334 may offer a dispute handling mechanism. In the event of disagreements or conflicts between data providers and consumers, administrators can intervene and make decisions. Some embodiments may provide more sophisticated workflows, such as recording reasons for decisions, sending out notifications to involved parties, or integrating with third-party mediation services. Dispute records may be associated with data sets, feedback, or the like and viewable by users in some cases. In some embodiments, these records and workflows could be extended to handle user reports, reviews, or feedback, allowing administrators to address concerns, resolve issues, or gather insights for platform improvements. Or some embodiments may support community-based moderation, where users or forum administrators participate in these workflows.
[0086] Some embodiments of module 334 may include the following features: In some embodiments, administrators may have the capability to select multiple data sets and perform bulk actions, such as verifying, flagging, or deleting them in one go; The system might record administrator actions, such as changes made, data sets viewed, or disputes handled, offering transparency and traceability; In certain embodiments, real-time notifications could alert administrators to urgent issues, new data set submissions, or significant platform events; An analytics dashboard could provide administrators insights into user behavior, popular data sets, and other platform metrics; The system may include tools to collect feedback from users on flagged data sets, offering administrators a more comprehensive view of potential issues.
[0087] Some embodiments include infrastructure management module 336. This module 336 may manage server health, scalability, and resource allocation. Some embodiments may include auto-scaling capabilities, backup systems, and disaster recovery failover routines. In some embodiments, a method may involve regularly monitoring the health or performance metrics of one or more servers that are part of an online platform's infrastructure. These servers may be physical or virtual machines, and they might be located in a single data center or distributed across multiple geographic locations. One way to ascertain the health or performance metrics of a server is by sending HTTP requests to an API endpoint provided by a cloud service provider or an infrastructure management platform. In some embodiments, the module 336 may communicate with services such as AWS ™, GCP ™, Azure ™ , or any other cloud or on-premise infrastructure provider to retrieve these metrics.
[0088] In some embodiments, when a certain metric, such as CPU (central processing unit) or memory usage, exceeds a predetermined (or dynamically determined) threshold, the module 336 may take an action to adjust the resources allocated to that server. This action may involve increasing the computational power, memory, or storage of the server, or replicating that server and load-balancing between instances. For instance, if the CPU usage of a server exceeds 80%, the system may choose to scale up the resources for that server. On the other hand, if the CPU usage drops below a certain threshold, say 20%, the system may scale down the resources or terminate an instance of a server. The exact thresholds can vary based on implementation and requirements. Apart from CPU usage, other metrics that the module 336 may consider in some embodiments include RAM (random access memory) usage, network throughput, disk I/O (input/output), number of concurrent users, and database query response times, among others. In some instances, a combination of metrics may be used to make scaling decisions. In some embodiments, machine learning algorithms or predictive analytics models of module 336 may be employed to predict future resource demands and adjust resources proactively. In some embodiments, the module 336 may also notify system administrators or other stakeholders when certain conditions are met or when specific actions are taken. Such notifications may be delivered through email, SMS (short message service), application dashboards, or other communication channels. Additionally, to ensure optimal performance and responsiveness, in some embodiments, the module 336 may employ load balancers to distribute incoming traffic among multiple servers. The system may adjust the rules or configurations of these load balancers based on the monitored metrics.
[0089] Some embodiments of module 336 may include other features, such as the following: A failover mechanism that redirects traffic or operations from a failing server to a healthy one; Integration with third-party monitoring tools, such as Datadog ™, New Relic ™, or Grafana ™, which might offer deeper insights or visualizations of the performance data; A logging mechanism that records all scaling actions, reasons for these actions, and the state of the system before and after the action; A backup system that periodically creates backups of data and configurations, ensuring data integrity and allowing for recovery in case of failures; A disaster recovery plan that defines steps to restore normal operations in catastrophic scenarios, such as data center outages; and the module 336, in some embodiments, may provide an interface for administrators to manually override scaling decisions or to set specific scaling policies based on expected events, such as planned maintenance or anticipated traffic spikes due to promotions or events.
[0090] Figure 2 is a flow chart showing an example of a process 500 by which data may be discovered. In some embodiments, the process 500 may be performed by the module 316 described above, or in some cases the process may be performed by other components. The steps of the process 500, like the other processes described herein, may be performed in a different order from that illustrated: additional steps may be inserted; some steps may be repeated more times than others in a given iteration; and some steps may be performed concurrently with one another, or in some cases the steps may be performed serially, none of which is to suggest that any other description herein is limiting. The process, like the other processes herein, may be implemented by computer-readable instructions stored on a tangible, non-transitory medium, such that when the instructions are executed by a computer system, the described functionality is effectuated.
[0091] In some embodiments, the process 500 may include obtaining a plurality of data sets from a plurality of different users of a data exchange as indicated by block 510. Some users (which may be human or other computer processes) may only write data sets, some may only read data sets, and some may do both. In some embodiments, the data sets may also be referred to as data assets. In some cases, the data sets may be obtained by different users uploading their data to the data exchange over time to share that data with other users of the data exchange. In some cases, the data sets may be obtained from third-party computer systems that push data or from which data is pulled to constitute the data sets, examples including census data hosted by
the US Government, computer logs, measurement logs of industrial process control systems, results of clinical studies, business data, and the like.
[0092] In some cases, the data exchange is the above-described server system 300 or the data exchange may take other forms. In some embodiments, the data exchange provides a forum in which data sets can be shared, revised, commented upon, transformed, explored, analyzed, and otherwise used by a community of users. In some cases, the users are from the same organization or in some cases the users are part of a broader community. In some cases, access to different data sets may be limited by roles and permissions as described above.
[0093] Data sets may each take a variety of different forms. Examples include tabular data, unstructured data, natural language text, semi-structured data, binary blobs, and the like. In some embodiments, the data sets have a diversity of different schemas. In some cases, the data sets take the form of relational databases export having a plurality of tables with key values by which different tables may be joined, for example, in third normal form. In some cases, the data sets are semi-structured or unstructured documents such as JSON or XML files. In some cases, the data sets are comma separated value (CSV) files. In some cases, the data sets are key-value store exports, such as dictionaries.
[0094] In some embodiments, the data sets include a plurality of tables in each of at least some of the data sets, such as more than 5 or more than 10 tables in some subset of the data sets. In some cases, some of the data sets may be relatively large, for example, bigger than 500 megabytes, 1 terabyte, 2 terabytes, or larger. The data sets may include both numeric and alphanumeric data in some cases. In some cases, data in the data set is human-interpretable without passing through a codec, e.g., an image file itself typically would not be a data set, though data sets may include images, audio, video, or the like.
[0095] Data sets may be associated with various forms of metadata, including a version of the data set, a day the data set was created, an identifier of a provider of the data set, a natural language prose description of the data set, various applicable tags, characteristic of the data set in an ontology of data set attributes, and social data relevant to the data set accumulated through processes like those described below with reference to Figure 3. The data sets may be obtained over time asynchronously relative to the other steps of the process 500 (which is not to suggest other steps must be performed synchronously). In some embodiments, more than 10, more than
100, or more than 10,000 different data sets are obtained over more than a day, month, or year, from more than 10, 100, or 1,000 different users.
[0096] Some embodiments of the process 500 include receiving a query to search among the plurality of data sets as indicated by block 512. In some cases, the query is a natural-language text query submitted via a text box interface of a web page provided by the server system 300 to a client computing device 307 as discussed above with reference to Figure 1. In some cases, the query is a Boolean search query. In some embodiments the query is generated by a process executed by the system 300. For example, a process by which a dashboard or personalized user interface like a web page is prepared for a user. In some cases, a recommendation engine executed by module 316 may periodically or in response to new data execute an update of a personalized list, such as a ranked list of data sets for a user based on the user's profde. In some cases, this process may include submitting one or more queries for data sets having attributes that a recommendation engine indicates would be relevant or otherwise helpful to the user based on their profde. For instance, based upon collaborative filtering and past exchanges with the data exchange from a user and other users.
[0097] In some cases, the query may include a plurality of query parameters, such as one or more tokens of natural language text, such as tokens in n-grams of natural language text. In some cases, some or all of these n-grams may be enriched by supplementing the query with synonyms of those n-grams, such as synonyms of individual words. To this end, some embodiments may maintain in memory a predetermined dictionary of n-grams (or more generally, tokens) and their synonyms. In some cases, these synonyms are one-to-many mappings of individual words to a respective set of synonyms, or in some cases, they are listings of corresponding longer phrases with similar or the same meaning. Some embodiments may also remove n-grams from the query, such as members of a list of “stop words,” like those having a TF-IDF or BM25 score relative to a corpus (like the plurality of data sets obtained) less than a threshold value. Words such as “the,” “an,” “a,” and the like may add little value to some queries because they can be so common in the corpus of data sets being searched. Search may be based on a variety of approaches, such as distributional semantics approaches. In some cases, search may be implemented with a bag-of-words model, latent semantic analysis, vector search in semantic embedding spaces, etc.
[0098] Some embodiments of the process 500 include determining a group of data sets among the plurality of data sets that is responsive to the query, as indicated by block 514. In some
cases, the group may include a plurality of data sets that is smaller than the total number of data sets maintained in the data exchange and obtained in block 510. In some cases, the group is less than 1% of those data sets, for example. And the group of data sets may include searching the data exchange for data sets including one or more of the query terms, such as the n-gram or the synonyms discussed above. In some cases, the search may be across a plurality of tables in each of some or all of the data sets in the exchange. In some embodiments, the search includes searching data set titles, field names (such as column or row headers), and data set values. Some embodiments may also search the various forms of commentary associated with data sets through user interaction, for example, like those described below with reference to figure 3.
[0099] Different forms of search may be used. In some cases, the search may be based upon an amount of times terms in the query appear within the respective data sets, e.g., normalized by data set, table, or field size. In some cases, the search may be based upon distance between vectors characterizing the query and the data sets or parts thereof, like field names or values or commentary in an embedding space with those data sets having vectors within a threshold distance to that of the embedding vector of the query being deemed responsive and being added to the group.
[0100] In some embodiments, computer-based information retrieval mechanisms may utilize binary representations where data sets (or parts thereof) and queries are represented as binary vectors within a term space. Such representations can be advantageous for certain types of data or specific query scenarios. In these embodiments, queries may be evaluated using logical operations like AND, OR, and NOT to filter and retrieve relevant data sets from a collection. For instance, a system may iterate over a set of data sets, evaluating whether each data set (or part thereof) matches the criteria specified in the query.
[0101] In some embodiments, an algebraic model may be employed, where both data sets (or parts thereof) and queries are represented as vectors within a term space, like an embedding space. Here, the degree of similarity or relevance between a query and a data set may be computed based on the cosine of the angle between their respective vectors, e.g., with Word2Vec or the like. The computation might involve determining the dot product of the two vectors and dividing by the product of their magnitudes. Such a method may provide a measure of similarity that can rank data sets in order of relevance to the query.
[0102] For some information retrieval scenarios, probabilistic models may be applied. In these models, data sets can be ranked based on the computed probability that a particular data set (or part thereof) is relevant to a query. This computation might draw from various probabilistic principles and formulas, depending on the specifics of the model chosen. For instance, some models might use Bayes' theorem to calculate the likelihood of relevance.
[0103] In other embodiments, feature-based retrieval models may be employed. Here, machine learning techniques can be used to learn a ranking function from feature vectors derived from the data sets and queries. Such models may extract a set of features from both the query and the data set (or parts thereof), forming a feature vector. A trained machine learning model might then predict a score for the data set based on this feature vector. The score may indicate the data set's relevance to the query, allowing for a ranked list of data sets. To further enhance the effectiveness of such models, developers might choose to add various features to the feature vectors, such as term frequency, data set length, or even external information like user behavior with the data set.
[0104] Latent patterns in data may also be explored for retrieval. In some embodiments, techniques similar to Latent Semantic Indexing (LSI) might be utilized. Such methods may involve operations analogous to singular value decomposition on term -document (e.g., by data set, table, row, column, etc.) matrices to discern patterns and potentially reduce dimensionality, thereby capturing implicit associations between terms and data sets.
[0105] Furthermore, in some embodiments, neural network-based models may be used. These models can capture semantic meanings and relationships in data. Models similar to deep learning architectures, such as transformers or recurrent networks, might be implemented to understand and retrieve data sets based on the context and deeper semantic relationships present in the data.
[0106] Feedback mechanisms might be integrated into some retrieval systems. In some cases, the system may refine searches based on user feedback regarding the relevance of retrieved data sets. In other scenarios, the system might assume that the top-ranked data sets from an initial search are relevant and use this assumption to refine subsequent searches, a method similar to pseudo-relevance feedback.
[0107] Additionally, some embodiments might incorporate features like clustering (e.g., DB- SCAN, k-means, or the like), where data sets are grouped based on similarities, or
classification, where data sets are categorized into predefined classes. In some cases, clustering is implemented with Latent Dirichlet Allocation (LDA), by topic. In some embodiments, multiple aforementioned techniques and models may be combined or modified to achieve desired retrieval performance and characteristics. Different combinations and adaptations might be chosen based on the nature of the application, the kind of data being processed, and specific retrieval requirements.
[0108] In some embodiments to expedite searches, data sets may be indexed. In some cases, the data sets are indexed by n-gram, for example, in a table or prefix tree in which query terms are associated with data sets or portions thereof. For example, a prefix tree may have leaf nodes that point to individual entries, fields, tables, or data sets, and higher-level branches of the prefix tree may correspond to different parts of the query. In another example, the data sets may be indexed by the above-described embedding vectors, with embedding vectors corresponding to query terms being associated with data sets or portions thereof within a threshold distance of those embedding vectors to facilitate relatively fast retrieval. Indexing may be performed before receiving the query. For example, periodically or in response to receiving a new data set or new version thereof.
[0109] Some embodiments include determining amounts of data absent from the respective data sets in the group as indicated by block 516. Reference to the group in this step should not be read to imply that this step must be performed after the group is determined in step 514. In some embodiments, amounts of data absent from some or all of the data sets in the data exchange may be determined periodically or in response to new data sets or versions thereof, before the query is received in step 512.
[0110] Absent data may be detected with a variety of techniques. Some embodiments may, for each of a plurality of fields of a data set, count the total number of values and the total number of values determined to indicate absent data, and calculate a ratio. Some embodiments may determine a value is absent in response to determining that the value is null, which may be expressly indicated with a null value or a blank, or the value zero in some cases (all of which are consistent with reference to detecting a “null” value herein). In some cases, different tests may be used based on field names, e.g. , names indicating dates may be handled differently than those indicating sensor readings.
[oni] Some embodiments may infer that data is absent based upon a data value having the same value as a measure of central tendency of that field in a table, such as the mean, median, or mode for a given field in a given table in a given data set. This may indicate that placeholder values were used when the data set was created or edited.
[0112] Some embodiments may attempt to infer fraudulent data with machine learning techniques or other approaches, for example, scoring data sets based upon: whether they comply with Benford's law, whether they deviate from normal distributions or other statistical distributions, whether there are gaps in values that regularly increment or decrement, or other similar approaches. In some cases, this potentially fraudulent data may also be flagged as another dimension upon which subsequent rankings may be performed with data more likely to have fraudulent values being downranked. Similarly, more absent values, either in total or as a percentage of total data, may cause downranking. In some cases, a count of each may be determined, or some embodiments may produce a weighted count in which null values are given more weight than values equal to a measure of central tendency.
[0113] Some embodiments may determine, based on the amounts of absent data, a ranking of members of the group of data sets responsive to the query, as indicated by block 518. In some cases, those members deemed more responsive may be ranked higher in the ranking. In some cases, step 514 and step 516 may be performed concurrently. For example, some or all of the data sets in the exchange may be ranked by responsiveness to the query, and the group may be determined by selecting those data sets having higher than a threshold ranking. Reference to “the group” of data sets should not be read to imply, in virtue of using the definite article “the,” that the ranking step must be performed after determining the group of data sets. In some cases, the ranking is based on both relevance as determined by the presence of query terms or proximate embedding vectors, and the amounts of absent data, for instance, with weightings applied to each factor. Some embodiments may also account for potentially fraudulent data in the ranking. For instance, with a weighted adjusted relevance score that accounts for both the absent data, the fraudulent data, and the initial assessment.
[0114] In some cases, step 514 may produce an initial ranking, and then that ranking may be re-ranked in step 518 when accounting for the absent data. Thus, the ranking in step 518 may be based on both ranking steps. Some embodiments may perform an additional or first thresholding step with the ranking of operation 518 to select a subset of members of the group for further processing such as those having higher than a threshold rank after the ranking of
step 518. In some cases, the number of resulting data sets and their rankings may be between 1 and 100, for instance between 5 and 50, like around 10 or 20.
[0115] In some cases, various third-party applications may connect to the data exchange, to layer services like dashboards, visualizations, notifications, and the like on top of data in the data exchange. These connections may, for example, specify individual data sets or queries thereof and callback functions or APIs to be accessed when new data is available to update the application, which may include event handlers responsive to these updates. In some cases, rankings may be based upon the amount of these third-party applications or a popularity weighted count of these third-party applications as well.
[0116] Some embodiments of the process 500 may include sending summaries of at least some members of the group of data sets with instructions to present the summaries according to the ranking as indicated by block 520. In some cases, the summaries that are sent may be sent to a web-based user interface with instructions that cause a web browser to present a new or otherwise updated user interface in which the data sets and their relative rank are characterized visually. In some cases, the instructions (which can be executable code or data that causes local code to act) direct the web browser or a native application to present an area-based graphical representation of the responsive data sets, such as a tree map. In some cases, the amount of area attributable to each data set may be based on the size of the data set, the ranking of the data set, a value indicative of reliability of the data set or the like.
[0117] The ranking may be characterized as a single-level hierarchical data structure. Other examples that may be produced by the search include a multi-level taxonomy in which data sets are grouped by genus, species, and sub-species (e.g., with Latent Dirichlet Allocation (LDA)), where the most relevant examples of a genus, or species, responsive to the query are identified. In some embodiments, hierarchical data structures can be visually represented using a tree map or other area-based visualization. The tree map divides a given display area into regions, where each region may correspond to a data value or set of data values. The size and/or shape of each region may be proportionate to the value it represents, like relevance score for the query, frequency of access, data set size, amounts of absent data, ratings by users, etc.
[0118] In some embodiments, the display area may be divided into rectangles. For each node of the hierarchical data, if the node is a leaf, a rectangle corresponding to the node's value may be drawn. If the node is not a leaf, the encompassing rectangle may be subdivided based on the
values of the node's children. Each child node may then be processed similarly, wherein its corresponding sub-rectangle is either drawn if the child node is a leaf or further subdivided if it is not. In some cases, only a single layer of the hierarchy is shown, without rectangles inside rectangles. In other cases, internal regions may represent parts of the data set or groups of data sets, like individual tables, fields, etc.
[0119] In other embodiments, the regions might be convex polygons. For a given node, if it represents a leaf in the hierarchy, a convex polygon representing the node's value may be drawn. Alternatively, if the node is not a leaf, the polygonal space may be divided into convex sub-polygons that are based on the values of the node's children. Each child node might then be similarly processed within its corresponding sub-polygon. In some embodiments, the convex polygons might be orthoconvex, meaning the polygons may exhibit right angles. The space might be divided such that orthoconvex polygons represent the hierarchical data values.
[0120] Some embodiments may employ Voronoi diagrams to visually represent the hierarchical data. A Voronoi diagram may be generated based on the node values within a given space. For each cell or partition of the Voronoi diagram, if the corresponding node is a leaf in the hierarchy, the cell might be filled or colored accordingly. Otherwise, the cell or partition may be subjected to further Voronoi diagram generation based on the children of the node. Another method of representation, in some embodiments, may involve piecing regions together in a manner reminiscent of a jigsaw puzzle. The given space may be divided such that the regions fit together in a contiguous manner without overlapping and while filling the entire space.
[0121] In further embodiments, the method might incorporate Gosper curves, which are continuous fractal space-filling curves, to represent the hierarchical data. For a given node, if it is a leaf in the data hierarchy, the corresponding space might be filled using a Gosper curve. If the node is not a leaf, the space may be divided based on Gosper curves that correspond to the values of the node's children.
[0122] It should be noted that before applying any of these methods to visualize a tree map, the hierarchical data may be aggregated to a desired level of granularity or detail. The choice of granularity might be influenced by factors such as the resolution of the display, the complexity of the data, or user preferences. Additionally, in some embodiments, different dimensions of the data may be represented using varying visual properties. For instance, the
size of regions or areas might represent one data dimension, while the color of those regions might represent another. Shades, gradients, patterns, or textures may also be utilized to convey additional information or dimensions. To further enhance user experience, some embodiments may introduce interactive features. These features might allow users to zoom into specific sections of the tree map, pan across the tree map, or hover over regions to access more detailed data. Such interactivity might assist users in exploring and understanding the underlying data in more depth.
[0123] It is also conceivable that some embodiments might prioritize stability in the visual representation. When underlying data changes, the tree map might adapt in a manner that minimizes drastic shifts or reconfigurations in the visual layout. This can be beneficial for allowing users to track changes or updates in the data overtime.
[0124] In some cases, the area-based visualization may support panning and zooming operations by which larger views of the sets of responsive data may be obtained or more detailed. Granular views may be accessed. In some cases, the user interface may be responsive to a user selection of a given area corresponding to a given data set to produce a sidebar or overlay in which additional information about the respective data set is revealed. In some cases, the summaries present an amount of data from the respective data set determined to be absent or fraudulent, a measure of a trend and an amount of interaction with the respective data set to indicate which data sets are trending, a measure of how often the query terms or synonyms thereof occur in the respective data set being summarized (e.g, a TF-IDF or BM25 score), or the like. In some cases, a user may touch, click or otherwise select a given data set in the areabased visualization to request and obtain a data set specific user interface in which additional information is shown. In some cases, including the actual data set, a description of its schema which may vary among the data sets, or various forms of social interaction and forms of social proof described below with reference to Figure 3.
[0125] Figure 3 illustrates an example of a process 600 that may be performed by the server system 300 discussed above to provide a social graph overlay on information in a data exchange like those discussed above. In some embodiments, the process 600 or different subsets thereof may be performed multiple times, for example, concurrently in different user sessions as users interrogate data, create new versions of data sets, search for data sets, and collaborate on data sets in the data exchange.
[0126] In some embodiments, the process 600 may include obtaining a plurality of data sets, as indicated by block 602. This may occur upon users uploading their data sets to the data exchange for sharing with others, and those uploads may be associated in memory with user profiles of the respective uploading user. In some embodiments, the data sets are in a variety of different formats. In some cases, the number of data sets may be quite large, for example, more than several hundred or several thousand.
[0127] Some embodiments include receiving from a user a selection of a first data set among those that were uploaded or otherwise obtained, as indicated by block 604. In some cases, the selection may be received by a selection on a web browser or native application among a set of data sets presented to a user responsive to a query or in a user dashboard or otherwise identified to a user, for example, using techniques like those described in figure 2.
[0128] Some embodiments may include receiving a message from the first user to be shared with a second user in association with the first data set regarding collaboration between the first user and the second user on the first data set, as indicated by block 606. In some cases, the first user may select an entire data set to be associated with the message, a table in a data set, a field in a data set, a record in a data set, or an individual value in the data set to be associated with the message, each of which is an example of associating the message with the first data set. In some cases, the message may be a comment on one of these pieces of information, a request for an explanation or a change in one of these pieces of information, a problem noticed regarding one of these pieces of information, or an explanation of one of these pieces of information, like a data analysis. The message may be in the form of natural -language text, or in some cases, message may include other formats like audio or video or visualizations, such as graphical analyses of data in the first data set. The message need not explicitly mention collaboration to be regarding collaboration. It is enough that the first user supplies a message that is potentially viewable by the second user. The message need not be exclusively addressed to the second user to be a message to the second user, as long as it is viewable by the second user. The second user need not be explicitly identified when the message is received. It is enough that the second user is or will become a member of a population of users, such as a subpopulation of users, that have access to the message.
[0129] Some embodiments include pre-computing summary statistics or visualizations for each of a plurality of fields in the first data set, as indicated by block 608. This step, like the other steps of the methods herein, may be performed in a different order from the order in
which it is presented. For example, pre-computing may be performed upon obtaining the first data set, before receiving the selection or message discussed above. In some embodiments, the pre-computing may be performed with the above-described analysis tools integration module 322.
[0130] Examples of pre-computing include computing population statistics, like measures of central tendency, such as mean, median, or mode, and measures of variation in data values in a field such as standard deviation, variance, kurtosis, ranges between minimum and maximum values, cardinality, and the like. In some cases, pre-computing may include computing multivariate statistics such as correlations or preparing multivariate visualizations such as XY charts in Cartesian coordinates to show how two variables correlate. Some embodiments may analyze each pairwise combination of fields to identify those with particularly strong correlations or otherwise interesting correlations to select a subset of those pairwise (or three- way or higher-order) combinations to pre-compute multivariate statistics or multivariate visualizations in two, three, or more dimensions.
[0131] Some embodiments include causing the message and the summary statistics or visualizations to be presented to the second user, as indicated by block 610. In some cases, this information may be presented on the second user's web browser after the second user navigated to the first data set, for instance, by searching for it or having it suggested to the second user. In some cases, messages may be specifically addressed as direct messages to the second user, in which case the second user may receive notifications in a message inbox with links to the corresponding associated data sets. Causing the message and the summary to be presented may include sending instructions by which the information is presented in a web browser or a native application of a client device of the second user. In some cases, those instructions may take the form of data, such as data of the first data set sent to the user, with other instructions needed to present the data already being resident on the second user's client computing device. In some cases, the data may be presented visually, e.g., on a screen of the second user's client computing device.
[0132] Some embodiments include obtaining another version of the first data set that has undergone a revision, and logging transformations to the first data set and comments on the transformations, as indicated by block 612. In some cases, the second user may make those changes, or a plurality of users may make those changes. In some cases, the logged transformations or comments may take the form of one of the above-described messages, or
these may be maintained in a separate data structure. In some cases, versions may be maintained in a version tree, in which versions may be branched or merged by users. For example, in a structure like that used in a Git repository, in some cases with cryptographic hash digests of each version associated with each respective version. In some cases, the logged transformations or the new versions may be encoded as a set of differences relative to the prior version. In some cases, the new versions may be encoded as a list of transactions by which the prior version was transformed into the subsequent version. In some cases, comments may be associated with a user identifier of the user making the transformations. In some cases, multiple users may concurrently participate in the transformations, each supplying different comments.
[0133] Some embodiments may include causing a listing of versions of the first data set and at least some of the log changes to be presented to the second user, as indicated by block 614. For example, the second user may be presented with a version history, like a version tree, that the second user can interrogate to selectively reveal logged transformations and comments associated with those respective versions. Presenting here, like elsewhere, may take a form like that described above, for example, on a client computing device of the second user, in a native application or a web browser. In some cases, the second user may be invited in those user interfaces to make comments on the transformations or otherwise send messages to those making the transformations or commenting on the version. In some cases, the second user's user interface may also allow the second user to spawn another version in which the second user may make transformations that are logged and supply comments on those transformations. In some cases, versions are immutable, and changes are made by creating new versions.
[0134] Some embodiments may include hosting a conversation among a plurality of users about the first data set, as indicated by block 616. In some cases, this conversation may include messages like those described above, pertaining to different versions of the first data set. In some cases, the conversation may be a separate class of such messages, for example, maintained in a feed, presented in ranked order by age, or by salience to the user viewing the feed. In some cases, that feed may support threads by which new topics may be spawned and messages pertinent to those topics may be exchanged by different users interacting with the above described server system on their respective client computing devices.
[0135] Some embodiments may include providing to the second user a source of the first data set and a set of downstream computer-services drawing from the first data set to perform their services, as indicated by block 618. In some cases, this information may be relevant to the
second user in evaluating the importance or quality of the first data set and its influence on the wider world. This information may also be helpful for ascertaining whether to comment on or rely on the first data set. In some embodiments, those downstream computer services may provide user interfaces that link back to a user interface hosted by the above-described server system, by which the first data set can be accessed as a form of transparency into those respective services. Examples of downstream computer services include native applications presenting information drawn from the first data set, websites presenting information drawn therefrom, automated actions taken by industrial process controls or Intemet-of-things appliances based on the values in the first data set and the like.
[0136] Some embodiments of the process 600 may include receiving from a third user a quality score of a second data set among the plurality of data sets, as indicated by block 620. In some cases, the quality score may take the form of a binary value of a thumbs up or thumbs down; a number of stars in a five-star scale; a rating from 0 to 100; a ranking of the second data set among the plurality of data sets or a subset thereof; or the like. In some cases, the quality score may be an aggregate score based upon similar types of scores pertaining to a variety of topics, such as data completeness, accuracy, reliability, plausibility, or the like, like a linear weighted combination of such sub scores. In some cases, quality scores may be algorithmically generated using techniques like those described above as well. These quality scores may be stored in memory and presented to other users when accessing the second data set and used when ranking search results to select which data sets to present favoring those with higher quality scores and some embodiments.
[0137] Some embodiments may include selecting or up-ranking the second data set in response to a query from a fourth user based on the quality score, as indicated by block 622. In some cases, the fourth user may submit a query (e.g., a natural language query or a Boolean query) and a plurality of data sets may be responsive to that query, such as a plurality of data sets including the second data set. Some embodiments may rank those candidates in whole or in part based upon their quality scores, for example, based on a weighted combination of query relevance and quality score to select which data sets to present responsive to the query (e.g., selecting those above a threshold score or rank) or to rank the order in which those responsive data sets are presented.
[0138] Some embodiments may include determining a measure of credibility of the first data set based on a plurality of quality scores, as indicated by block 624. In some cases, the measure
of credibility may also account for algorithmically generated quality scores or values indicative of absent data using techniques like those described above. In some cases, the measure of credibility may be a measure of central tendency of such measurements or a weighted linear combination of such scores. The resulting measure of credibility may be stored in memory and used when selecting or ranking data sets responsive to queries. In some cases, the measure of credibility may be displayed in association with the first data set when presenting the first data set to users in user interfaces. In some cases, data sets with particularly low measures of credibility may be purged from the system, flagged as lower quality, or otherwise suppressed when selecting data sets to present to users. In some cases, users may inherit the measures of credibility associated with the data sets they upload or data sets they edit. In some cases, measures of credibility may be specific to particular versions of data sets or subsets of data sets like tables or fields or values.
[0139] Some embodiments may include presenting a profile of a user who created the first data set, as indicated by block 626. In some cases, the user's profile may include biographical information, links to the data sets they have provided, links to comments on data sets they've made, links to other users with whom they're associated in a social graph in the system, and the like. In some cases, a social graph may be formed with links between users being created based upon whether those users have accessed the same data set, have commented on the same data set, have messaged each other, or the like. In some cases, these social graphs may be used when selecting content, such as data sets to present to users, for example, upranking or selecting those data sets created or viewed by adjacent users in that social graph.
[0140] Some embodiments of the above-described server system 300 may be configured to accommodate relatively diverse data schemas and formats, with different uploads having different examples of each. In some cases, it may be difficult to predict what data schemas will be present in uploaded data from users and what data formats will be used in any given upload. Many traditional relational database management systems are not well suited for highly variable schemas or formats of ingested data. Often relational databases are arranged in third normal form with predefined tables and relationships there between (which is not to suggest that third normal form or relational databases are disclaimed). But newly ingested data of a new schema different from that encountered before may not comport with such an arrangement. And it can be slow and difficult to rewrite the relational database and, in some cases,
impractical. Thus, relational database management systems are often insufficiently flexible for highly diverse data and formats of the same.
[0141] Some embodiments may implement the above-described database 308 by drawing upon features of graph databases and NoSQL (no structured query language) techniques to provide a more flexible database with respect to schema and format of ingested data. Some embodiments may store values of records as nodes in a graph and label those nodes with metadata such as fields of the records. Some embodiments may encode connections or relationships between values as edges in the graph. For example, a record identifier of a given record in a data set may share an edge with each value in that record, and a user identifier in that record may share an edge with a user attribute node. In some cases, edges may be labeled, for instance as predicates in Resource Description Framework (RDF) triples. In some embodiments, fields or other metadata may be encoded as values, as well with relationships indicating that they have that role encoded as edges in the graph.
[0142] Nodes and edges in a graph data structure need not be explicitly labeled as such to constitute a graph data structure, as long as the characteristics of a graph data structure are present. For example, graph data structures may be encoded as key-value pairs, JSON documents, relational databases, and the like.
[0143] In some embodiments, a method for transforming an arbitrary input dataset with an unknown schema and data format into entries in a graph database is provided and executed by module 314 upon upload. This method may involve dynamically interpreting the data and mapping it to a graph structure. The method, in some embodiments, begins with data inspection and parsing, where the format of the input data is determined, which could be Comma- Separated Values (CSV), JavaScript Object Notation (JSON), Extensible Markup Language (XML), or any other structured or semi-structured format. The data, in some embodiments, is then parsed according to its identified format.
[0144] Following data parsing, in some embodiments, schema inference is performed. In this step, the parsed data is analyzed to infer its schema, which may involve identifying potential nodes and relationships that can be represented in a graph database. This process may utilize machine learning algorithms, such as named entity extraction models and natural-language processing knowledge extraction models, or heuristic methods to deduce the structure and relationships within the data.
[0145] Once the schema is inferred, the method, in some embodiments, involves designing a graph model. This step may include deciding how to represent various entities and their interrelations as nodes and edges in a graph structure. The design of the graph model may vary based on the requirements and nature of the data. Some embodiments may decompose data into key-value pairs, a dictionary, or RDF triples.
[0146] The next step, in some embodiments, in the method is data transformation. In this phase, the original data is converted into the graph model that was designed in the previous step. This conversion process may involve the creation of nodes, edges, and properties in a graph structure that mirrors the relationships and entities identified in the input data.
[0147] Subsequently, the method, in some embodiments, includes loading the transformed data into a graph database. This step may APIs of graph databases such as Neo4j™ or other similar graph database systems, such as an instance of Postgres ™ configured to act as a graph database. The data loading process might involve the creation of database entries that correspond to the nodes and edges of the graph model. In some embodiments, the method may include additional features such as error handling mechanisms to manage inconsistencies or irregularities in the input data. The method might also handle various edge cases, such as nested structures or different datatypes like lists and null values.
[0148] Other possible features that developers might choose to add include the ability to automatically detect the format of the input data, thereby reducing the need for manual specification of the data format. Furthermore, the method could be extended to include various data normalization and cleaning steps prior to schema inference, ensuring that the data is in a suitable format for processing and transformation. This might involve removing duplicates, standardizing formats, or correcting errors. In some embodiments, the method may also support scalability features, allowing it to handle large datasets efficiently. This could involve the use of distributed computing techniques or optimizations specific to graph databases.
[0149] In some embodiments, the database 308 may implement index-free adjacency to facilitate rapid traversal of relationships between data nodes. In this configuration, each data node directly points to its adjacent nodes, thus allowing for swift navigation through these connections without necessitating an index for location purposes. This feature may be particularly beneficial for executing queries that involve extensive relationship traversals or complex join operations. Furthermore, in certain embodiments, the database system may be
configured for handling connected data efficiently. The index-free adjacency model is expected to ensure that the computational cost of traversing relationships does not escalate substantially as the dataset expands. This contrasts with traditional relational database systems where join operations across large tables can be resource-intensive. Some embodiments may benefit from the lower latency offered by the index-free adjacency in the database system. This reduced latency is a result of the elimination of index lookup operations. Additionally, the scalability of the database system may be enhanced in some embodiments due to the index-free nature. As data volume increases, the time required for relationship traversal in a graph database may not increase at the same rate as in an index-based database, given that the graph's structure is appropriately designed. In certain embodiments, the database system may allow for a more intuitive data modeling process. Since relationships are integral to graph databases, they can be represented more naturally compared to relational databases where relationships are often inferred from foreign keys and join tables. Some embodiments of the database system may support a dynamic schema, facilitating easier evolution of the database structure overtime. The index-free adjacency model is expected to complement this flexibility by allowing new relationship types to be added without necessitating re-indexing (though embodiments are also consistent with systems that do not have index free adjacency and that re-index). Consistency in query performance may be another feature in some embodiments. The direct access to adjacent nodes in a graph database can offer more predictable performance for relationshipbased queries compared to the variable performance in relational databases, where the structure of data and indices play a significant role. Moreover, in some embodiments, the database system may exhibit reduced storage overhead. The lack of a need for additional indexing structures to manage relationships can lead to a lower storage requirement for relationship data, as opposed to relational databases where indices can occupy substantial storage space.
[0150] In some embodiments, database 308 may employ a horizontally scalable architecture, which may allow for the distribution of data storage and processing across multiple server nodes. This scalability feature may permit the database management system to efficiently manage an increasing volume of data and user requests. Additionally, the system may employ replication and partitioning techniques to enhance data availability and fault tolerance, ensuring uninterrupted operation in the event of individual server failures. Furthermore, some embodiments may incorporate a flexible data model, supporting a variety of formats such as document, key-value, wide-column, and graph. The database 308 may also be schema-less, allowing data to be stored without a predefined schema (or some embodiments may have a
fixed schema). This feature may facilitate quicker adaptation to changing data models and application requirements. The database management system may also be particularly adept at handling large volumes of data at high speeds, making it suitable for applications in big data analytics and real-time data processing. In some embodiments, the database 308 may support agile development practices, enabling rapid prototyping and iteration on applications without the constraints imposed by rigid database schemas. This may lead to faster development cycles and adaptability in application design. Moreover, the database management system may be multi-model in some embodiments, supporting multiple data models such as document and key-value within the same system. This versatility may allow the database to meet a wide range of application needs.
[0151] In some embodiments, a system for dynamic data masking (DDM) is provided, designed to protect sensitive data during database query operations, based on roles and permissions assigned to users and a policy indicating whether the user’s role has permission to access (e.g., read or wrote) data implicated by a query from that user. This system may integrate with Role-Based Access Control (RBAC), wherein various roles within an organization are assigned distinct levels of permissions for accessing data. These permissions can range from full access to restricted access, which includes masked or obfuscated data visibility.
[0152] When a user submits a query to database 308, in such embodiments, the system initially authenticates the user and determines their assigned role and corresponding permissions. Depending on these permissions, specific data masking rules are applied to the query results. For instance, users with restricted access may receive results where sensitive data is partially or fully masked, whereas users with full access privileges may view the data unmasked. Masking may take a variety of forms ranging from presenting a null value, to replacing a subset of characters in a value (like the first five characters, or all but the last four) in a value with a X or a pound sign, computing a cryptographic hash of the value (like an SHA-256 hash), or the like. The data masking rules in some embodiments might include full masking, where all sensitive data is replaced with placeholders, or partial masking, where only a portion of the data is obscured. Alternative approaches like random replacement, where data is substituted with random but plausible values, or data redaction, which involves removing sensitive data altogether, may also be employed. In some embodiments, the masking of data occurs in realtime (e.g., within 20 seconds of a query, like within 500 milliseconds) as the database processes the query, and the original data in the database remains unchanged. This real-time operation
ensures that the actual data is not altered, preserving its integrity while still protecting sensitive information.
[0153] In addition, some embodiments may include additional features for enhanced security and compliance with regulatory standards such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA). These features could include audit trails for monitoring and logging access to sensitive data, and configurable masking rules to adapt to changing organizational roles and data access policies. In certain embodiments, the module 334 may also provide a user interface for administrators to easily define and update data masking rules, and for users to understand the extent of their data access permissions. Additionally, the system 300 could integrate with other security measures such as encryption and access logs, providing a comprehensive data security solution. Some embodiments may include variations in the implementation of RBAC, such as context-aware access controls where permissions change based on factors like user geolocation, device being used, or time of access. This flexibility allows the system to adapt to various scenarios and security requirements.
[0154] In some embodiments, a data anonymization method is applied to query responses, which may be based on a principle of k-anonymity. This method may ensure that the data for each individual in a dataset cannot be distinguished from a threshold population size. The implementation of k-anonymity may involve various techniques. One such technique is generalization, where specific data is replaced with more general information. For example, rather than including a full address in a dataset, only the city or state may be mentioned. Another technique is suppression, where certain data points are omitted entirely to protect individual identities. In these embodiments, the masking transformations may modify quasiidentifiers, which are attributes that are not unique identifiers by themselves but can collectively identify an individual when combined. These quasi-identifiers may include, but are not limited to, age, gender, and zip code.
[0155] While implementing k-anonymity, some embodiments may encounter and address potential limitations. For instance, a homogeneity attack, where all records in a group have the same sensitive value, might be mitigated by ensuring diversity in the group's sensitive values. Additionally, to counter background knowledge attacks, where an attacker uses external information to identify individuals, the method may include varying degrees of data generalization and suppression. To enhance the k-anonymity approach, some embodiments
may incorporate 1-diversity, which requires that sensitive attributes in each group of k- individuals have a variety of distinct values. Another extension, t-closeness, may be employed to ensure that the distribution of a sensitive attribute in any group closely resembles the distribution of that attribute across the entire dataset. In some embodiments, machine learning algorithms could be integrated to optimize the selection of quasi-identifiers and the degree of generalization and suppression. Algorithmic tools may be employed to evaluate the risk of privacy breaches in various k-anonymized datasets. Moreover, user interfaces could be designed to allow users to specify their desired level of privacy and to visualize how different k-anonymization strategies affect both privacy and data utility.
[0156] In some embodiments, system 300 implements differential privacy in query responses, e.g., based on roles and permissions, by applying a randomized algorithm that introduces random noise into the outputs of statistical queries, while maintaining population statistics like mean, median, and mode, or measures of variance. This approach may serve to mask the contribution of any single individual's data within the dataset, thereby maintaining individual privacy. The introduction of noise in such systems can be derived from statistical distributions, which may include but are not limited to, the Laplace distribution or the Gaussian distribution. The degree of noise added can be determined based on two parameters: an epsilon value and the sensitivity of the query, both of which may be specified by a policy associated with roles and permissions. The epsilon value, which may be referred to as the privacy budget, is a parameter that controls the level of privacy afforded by the system. A lower epsilon value may provide stronger privacy guarantees at the potential cost of reduced data utility. The sensitivity of a query, which measures the potential change in the query's output resulting from the alteration of a single individual's data in the dataset, also influences the amount of noise required for a given level of privacy.
[0157] [blank]
[0158] In some embodiments, these systems provide a mathematical guarantee that the output of a query is nearly the same whether or not any single individual's data is included in the dataset. This characteristic ensures that the presence or absence of one person's data does not significantly affect the results obtained from the database, thereby protecting individual privacy. The concept of composition in such systems is expected to allow for the management of multiple queries. As more queries are made to the database, the accumulated information may lead to a gradual weakening of the privacy guarantee. This aspect may be managed by
tracking the cumulative privacy budget and blocking queries that exceed that budget or masking more heavily in response to exceeding that threshold. In some embodiments, the approach to differential privacy may be implemented using either a global or local model. In the global differential privacy model, noise is added by the database curator before any data is released, ensuring that the disseminated data adheres to differential privacy standards. Alternatively, in the local differential privacy model, each individual may add noise to their own data before it is sent to the database. While this approach may offer stronger privacy guarantees, it can also result in a higher level of noise in the overall data.
[0159] Some embodiments implement query autocompletion for partially typed user queries in an input text box on a web browser or native application. An example is shown in the user interface 700 of figure 4. As shown, the user interface may be presented in a web browser or native application of a user client computing device. The UI may include a textbox input 702 into which the user has typed a first character, in this case the letter “c.” Some of the embodiments may present in the user interface 700 a list of query completion suggestions conditioned on this first letter (or longer partially entered query). In some embodiments, those suggestions may indicate both the proposed candidate query completion and the source of the proposed completion. For example, candidate 706 has an icon 704 indicating one source for this proposed completion within the database 308, while candidate 710 includes a different icon 708 indicating a different source for this candidate. Examples of different sources having different icons may include previous queries or comments by the user, previous queries or comments by others, dataset titles, table titles, dataset field names, dataset values, and the like. To expedite query completion, some embodiments may, before a query begins to be entered, build and periodically refresh an index of proposed query completions with associated icon identifiers indicating the source of the candidate completions to facilitate relatively quick creation and evolution of the user interface 700, which in some cases may update with each additional character typed to reflect the narrowing of potential list of candidates.
[0160] In some embodiments, each keystroke of a user is monitored as they type into a search or address bar. This monitoring may occur in real-time (e.g., within 500 ms), capturing user input dynamically. Following the detection of user input, partial queries, comprising the characters entered by the user, may be sent to a remote server. This transmission of partial queries could occur after a predetermined number of characters are entered or following a specific delay after typing, to optimize network traffic and system responsiveness. To handle
network latency and enhance user experience, some embodiments may employ techniques such as debouncing, which involves waiting for a pause in typing before sending a request to the server, and caching, where previous autocomplete results are stored for future reuse. Upon receiving the server's response, the browser may display suggested completions in a dropdown menu, ensuring that this process is asynchronous to maintain interface responsiveness.
[0161] On the server-side, partial queries received from the client may be processed, which may involve parsing and sanitizing the queries to prevent security vulnerabilities like SQL injection. In some embodiments, predictive algorithms are employed to predict what the user might be searching for. These algorithms may include, but are not limited to, N-gram models, Markov models, neural network models such as Recurrent Neural Networks, Long Short-Term Memory networks, Transformer-based models, Bayesian models, collaborative filtering, association rule mining, decision trees, random forests, and ranking algorithms.
[0162] A database lookup may be performed using the processed query to search a database or multiple databases of possible completions. This database may comprise popular queries, the user's history for personalized suggestions, and real-time data like trending searches or news. The server, in some embodiments, then compiles a list of suggestions based on the query, employing ranking algorithms to determine the most relevant completions. These suggestions, in some embodiments, are sent back to the client in a lightweight format such as JavaScript Object Notation for parsing by the browser. In some embodiments, the server may implement caching mechanisms for frequent queries and their completions to improve response time for common searches.
[0163] Additional features that developers might choose to add to such systems include personalized suggestion mechanisms based on user's past search history, integration with social media trends for real-time suggestion updates, voice recognition capabilities for hands-free typing, multilingual support for global user accessibility, and machine learning models that adapt to user behavior over time for more accurate predictions. Furthermore, load balancing may be implemented in server infrastructure to distribute the query load across multiple servers, ensuring efficient handling of a high volume of concurrent requests.
[0164] Some embodiments store candidate queries in a data structure known as a trie, or prefix tree, to store and retrieve a set of words or phrases efficiently. The system may comprise a server-side component and a client-side component, each playing a distinct role in facilitating
query completion. In some embodiments, the server includes a trie data structure, which is characterized by nodes and edges where each node represents a character of the alphabet and each path from the root node to a leaf node represents a word or phrase stored in the trie. The trie may be capable of inserting new words or phrases, which involves creating new nodes for each character in the word or phrase that is not already present in the trie, and marking the end of the word or phrase on the final node.
[0165] Additionally, the server 300 may include an API endpoint that accepts query prefixes as input and returns a list of suggestions based on the contents of the trie. This API endpoint may be designed to handle HTTP GET requests, where the query prefix is passed as a query parameter. The server processes this request by searching the trie for words or phrases that start with the given prefix, which may involve traversing the trie from the root node and following the paths that match the characters in the prefix. The server then collects the words or phrases that match the prefix and sends them back as a response to the client.
[0166] On the client side, the system may employ JavaScript™ to capture user input and interact with the server-side API. As a user types into an input field, the JavaScript™ code may send the current input value to the server's API endpoint, using asynchronous web requests, and then display the suggestions returned by the server.
[0167] Furthermore, the system may be extended or modified in various ways. For instance, the trie data structure on the server side may be augmented with additional features such as weight assignments to prioritize certain words or phrases, or the ability to handle fuzzy matching to account for typographical errors in the user input. The server component may also be designed to handle more complex queries or to integrate with databases or external APIs to retrieve the words or phrases for the trie.
[0168] Figure 5 shows an example of a default home page user interface 720 that may be shown on client computing devices 307 (figure 1) as they engage with the server system 300 (figure 1) discussed above in some embodiments. The default home page user interface 720, in this embodiment, includes a text input box 722 that in some embodiments may correspond to the text input box 702 discussed above with reference to figure 4. The default home page user interface 720 may further include cards 724 for recently added data sets such as those that the server system 300 may determine are of interest or are likely to be of interest to the user, like data sets that are likely to be of interest and have less than a threshold age.
[0169] Figure 6 shows an example of a dataset detail landing page 730 that may be shown on client computing devices, as they engage with the server system 300 discussed above. In some embodiments, the user interface 730 may show details of a specific dataset, such as one selected through the previous user interface discussed above with reference to figure 5. This user interface 730 may include a list of tables in the dataset 732, each of which may include an entry having details of the table displayed to the user along with statistics about the table like the number of rows and columns and values. The user interface 730 may also include an overview 736 of the dataset, for example drafted by the user who uploaded the dataset, and information about the dataset as a whole in region 738, including update history, contact information for the creator of the dataset, and amounts of missing data.
[0170] Figure 7 shows an example of a dataset detail history page 740 that may be shown on client computing devices as they engage with the server system 300 discussed above. In some embodiments, the user interface includes a data history transaction list 742 with a plurality of transaction 744 in which the dataset at issue was changed. Each transaction 744 may include identifiers of those who participated in the change, a number of changes in the transaction, and a summary of what was changed in the dataset, along with an indication of when the change was made.
[0171] Figure 8 shows an example of a table detail landing page 750 that may be shown on client computing devices responsive to selection of a table. In some embodiments, the user interface 750 includes a description of the table 752, a list of datasets of which it is a part, and the table itself 754, which may include headings 756 identifying field names and values 758 for instances of those fields.
[0172] Figure 9 shows an example of a search results page 760 that may be shown on client computing devices responsive to submission of a query. In some embodiments, the user interface 760 includes a list of search results 762, with descriptions of each of the datasets or tables, statistics about the same, age of the same, identifier of the party who uploaded the same, and number of matches to query terms. In some embodiments, the search results 762 may include both individual tables and datasets responsive to the query, facilitating relatively granular searches down to the table, field, or individual record level in some embodiments. In some cases, the UI indicates whether a responsive result 762 is a data set, a table, a field name, a value, or a comment with an icon like those discussed above with respect to query completion in figure 4.
[0173] Figure 10 is a diagram that illustrates an exemplary computing system 1000 in accordance with embodiments of the present technique. Various portions of systems and methods described herein, may include or be executed on one or more computer systems similar to computing system 1000. Further, processes and modules described herein may be executed by one or more processing systems similar to that of computing system 1000.
[0174] Computing system 1000 may include one or more physical electronic processors (e.g., processors 1010a- 101 On) coupled to system memory 1020, an input/output I/O device interface 1030, and a network interface 1040 via an input/output (I/O) interface 1050. A processor may include a single processor or a plurality of processors (e.g., distributed processors). A processor may be any suitable processor capable of executing or otherwise performing instructions. A processor may include a central processing unit (CPU) that carries out program instructions to perform the arithmetical, logical, and input/output operations of computing system 1000. A processor may execute code (e.g., processor firmware, a protocol stack, a database management system, an operating system, or a combination thereof) that creates an execution environment for program instructions. A processor may include a programmable processor. A processor may include general or special purpose microprocessors. A processor may receive instructions and data from a memory (e.g., system memory 1020). Computing system 1000 may be a uniprocessor system including one processor (e.g., processor 1010a), or a multi-processor system including any number of suitable processors (e.g., lOlOa-lOlOn). Multiple processors may be employed to provide for parallel or sequential execution of one or more portions of the techniques described herein. Processes, such as logic flows, described herein may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating corresponding output. Processes described herein may be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Computing system 1000 may include a plurality of computing devices (e.g. , distributed computer systems) to implement various processing functions.
[0175] I/O device interface 1030 may provide an interface for connection of one or more I/O devices 1060 to computer system 1000. I/O devices may include devices that receive input (e.g., from a user) or output information (e.g., to a user). I/O devices 1060 may include, for example, graphical user interface presented on displays (e.g., a cathode ray tube (CRT) or liquid crystal display (LCD) monitor), pointing devices (e.g., a computer mouse or trackball),
keyboards, keypads, touchpads, scanning devices, voice recognition devices, gesture recognition devices, printers, audio speakers, microphones, cameras, or the like. I/O devices 1060 may be connected to computer system 1000 through a wired or wireless connection. I/O devices 1060 may be connected to computer system 1000 from a remote location. I/O devices 1060 located on remote computer system, for example, may be connected to computer system 1000 via a network and network interface 1040.
[0176] Network interface 1040 may include a network adapter that provides for connection of computer system 1000 to a network. Network interface may 1040 may facilitate data exchange between computer system 1000 and other devices connected to the network. Network interface 1040 may support wired or wireless communication. The network may include an electronic communication network, such as the Internet, a local area network (LAN), a wide area network (WAN), a cellular communications network, or the like.
[0177] System memory 1020 may be configured to store program instructions 1100 or data 1110. Program instructions 1100 may be executable by a processor (e.g., one or more of processors 1010a- 101 On) to implement one or more embodiments of the present techniques. Instructions 1100 may include modules of computer program instructions for implementing one or more techniques described herein with regard to various processing modules. Program instructions may include a computer program (which in certain forms is known as a program, software, software application, script, or code). A computer program may be written in a programming language, including compiled or interpreted languages, or declarative or procedural languages. A computer program may include a unit suitable for use in a computing environment, including as a stand-alone program, a module, a component, or a subroutine. A computer program may or may not correspond to a file in a file system. A program may be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g. , files that store one or more modules, sub programs, or portions of code). A computer program may be deployed to be executed on one or more computer processors located locally at one site or distributed across multiple remote sites and interconnected by a communication network.
[0178] System memory 1020 may include a tangible program carrier having program instructions stored thereon. A tangible program carrier may include a non-transitory computer readable storage medium. A non-transitory computer readable storage medium may include a
machine readable storage device, a machine readable storage substrate, a memory device, or any combination thereof. Non-transitory computer readable storage medium may include nonvolatile memory (e.g., flash memory, ROM, PROM, EPROM, EEPROM memory), volatile memory (e.g., random access memory (RAM), static random access memory (SRAM), synchronous dynamic RAM (SDRAM)), bulk storage memory (e.g., CD-ROM and/or DVD- ROM, hard-drives), or the like. System memory 1020 may include a non-transitory computer readable storage medium that may have program instructions stored thereon that are executable by a computer processor (e.g., one or more of processors WlOa-lOlOn) to cause the subject matter and the functional operations described herein. A memory (e.g., system memory 1020) may include a single memory device and/or a plurality of memory devices (e.g., distributed memory devices). Instructions or other program code to provide the functionality described herein may be stored on a tangible, non-transitory computer readable media. In some cases, the entire set of instructions may be stored concurrently on the media, or in some cases, different parts of the instructions may be stored on the same media at different times.
[0179] I/O interface 1050 may be configured to coordinate I/O traffic between processors WlOa-lOlOn, system memory 1020, network interface 1040, RO devices 1060, and/or other peripheral devices. I/O interface 1050 may perform protocol, timing, or other data transformations to convert data signals from one component (e.g., system memory 1020) into a format suitable for use by another component (e.g., processors lOlOa-lOlOn). I/O interface 1050 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard.
[0180] Embodiments of the techniques described herein may be implemented using a single instance of computer system 1000 or multiple computer systems 1000 configured to host different portions or instances of embodiments. Multiple computer systems 1000 may provide for parallel or sequential processing/execution of one or more portions of the techniques described herein.
[0181] Those skilled in the art will appreciate that computer system 1000 is merely illustrative and is not intended to limit the scope of the techniques described herein. Computer system 1000 may include any combination of devices or software that may perform or otherwise provide for the performance of the techniques described herein. For example, computer system 1000 may include or be a combination of a cloud-computing system, a data center, a server
rack, a server, a virtual server, a desktop computer, a laptop computer, a tablet computer, a server device, a client device, a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a vehicle-mounted computer, or a Global Positioning System (GPS), or the like. Computer system 1000 may also be connected to other devices that are not illustrated, or may operate as a stand-alone system. In addition, the functionality provided by the illustrated components may in some embodiments be combined in fewer components or distributed in additional components. Similarly, in some embodiments, the functionality of some of the illustrated components may not be provided or other additional functionality may be available.
[0182] Those skilled in the art will also appreciate that while various items are illustrated as being stored in memory or on storage while being used, these items or portions of them may be transferred between memory and other storage devices for purposes of memory management and data integrity. Alternatively, in other embodiments some or all of the software components may execute in memory on another device and communicate with the illustrated computer system via inter-computer communication. Some or all of the system components or data structures may also be stored (e.g., as instructions or structured data) on a computer-accessible medium or a portable article to be read by an appropriate drive, various examples of which are described above. In some embodiments, instructions stored on a computer-accessible medium separate from computer system 1000 may be transmitted to computer system 1000 via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network or a wireless link. Various embodiments may further include receiving, sending, or storing instructions or data implemented in accordance with the foregoing description upon a computer-accessible medium. Accordingly, the present techniques may be practiced with other computer system configurations.
[0183] In block diagrams, illustrated components are depicted as discrete functional blocks, but embodiments are not limited to systems in which the functionality described herein is organized as illustrated. The functionality provided by each of the components may be provided by software or hardware modules that are differently organized than is presently depicted, for example such software or hardware may be intermingled, conjoined, replicated, broken up, distributed (e.g. within a data center or geographically), or otherwise differently organized. The functionality described herein may be provided by one or more processors of one or more computers executing code stored on a tangible, non-transitory, machine readable
medium. In some cases, notwithstanding use of the singular term “medium,” the instructions may be distributed on different storage devices associated with different computing devices, for instance, with each computing device having a different subset of the instructions, an implementation consistent with usage of the singular term “medium” herein. In some cases, third party content delivery networks may host some or all of the information conveyed over networks, in which case, to the extent information (e.g., content) is said to be supplied or otherwise provided, the information may provided by sending instructions to retrieve that information from a content delivery network.
[0184] The reader should appreciate that the present application describes several independently useful techniques. Rather than separating those techniques into multiple isolated patent applications, applicants have grouped these techniques into a single document because their related subject matter lends itself to economies in the application process. But the distinct advantages and aspects of such techniques should not be conflated. In some cases, embodiments address all of the deficiencies noted herein, but it should be understood that the techniques are independently useful, and some embodiments address only a subset of such problems or offer other, unmentioned benefits that will be apparent to those of skill in the art reviewing the present disclosure. Due to costs constraints, some techniques disclosed herein may not be presently claimed and may be claimed in later filings, such as continuation applications or by amending the present claims. Similarly, due to space constraints, neither the Abstract nor the Summary of the Invention sections of the present document should be taken as containing a comprehensive listing of all such techniques or all aspects of such techniques.
[0185] It should be understood that the description and the drawings are not intended to limit the present techniques to the particular form disclosed, but to the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present techniques as defined by the appended claims. Further modifications and alternative embodiments of various aspects of the techniques will be apparent to those skilled in the art in view of this description. Accordingly, this description and the drawings are to be construed as illustrative only and are for the purpose of teaching those skilled in the art the general manner of carrying out the present techniques. It is to be understood that the forms of the present techniques shown and described herein are to be taken as examples of embodiments. Elements and materials may be substituted for those illustrated and described herein, parts and processes may be reversed or omitted, and certain features of the present techniques may be utilized
independently, all as would be apparent to one skilled in the art after having the benefit of this description of the present techniques. Changes may be made in the elements described herein without departing from the spirit and scope of the present techniques as described in the following claims. Headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description.
[0186] As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). The words “include”, “including”, and “includes” and the like mean including, but not limited to. As used throughout this application, the singular forms “a,” “an,” and “the” include plural referents unless the content explicitly indicates otherwise. Thus, for example, reference to “an element” or “a element” includes a combination of two or more elements, notwithstanding use of other terms and phrases for one or more elements, such as “one or more.” The term “or” is, unless indicated otherwise, non-exclusive, i.e., encompassing both “and” and “or.” Terms describing conditional relationships, e.g. , “in response to X, Y,” “upon X, Y,”, “if X, Y,” “when X, Y,” and the like, encompass causal relationships in which the antecedent is a necessary causal condition, the antecedent is a sufficient causal condition, or the antecedent is a contributory causal condition of the consequent, e.g., “state X occurs upon condition Y obtaining” is generic to “X occurs solely upon Y” and “X occurs upon Y and Z.” Such conditional relationships are not limited to consequences that instantly follow the antecedent obtaining, as some consequences may be delayed, and in conditional statements, antecedents are connected to their consequents, e.g., the antecedent is relevant to the likelihood of the consequent occurring. Statements in which a plurality of attributes or functions are mapped to a plurality of objects (e.g., one or more processors performing steps A, B, C, and D) encompasses both all such attributes or functions being mapped to all such objects and subsets of the attributes or functions being mapped to subsets of the attributes or functions (e.g., both all processors each performing steps A-D, and a case in which processor 1 performs step A, processor 2 performs step B and part of step C, and processor 3 performs part of step C and step D), unless otherwise indicated. Similarly, reference to “a computer system” performing step A and “the computer system” performing step B can include the same computing device within the computer system performing both steps or different computing devices within the computer system performing steps A and B. Further, unless otherwise indicated, statements that one value or action is “based on” another condition or value encompass both instances in which the condition or value is the sole factor and instances in which the condition or value is
one factor among a plurality of factors. Unless otherwise indicated, statements that “each” instance of some collection have some property should not be read to exclude cases where some otherwise identical or similar members of a larger collection do not have the property, i.e., each does not necessarily mean each and every. Limitations as to sequence of recited steps should not be read into the claims unless explicitly specified, e.g., with explicit language like “after performing X, performing Y,” in contrast to statements that might be improperly argued to imply sequence limitations, like “performing X on items, performing Y on the X’ed items,” used forpurposes ofmaking claims more readable ratherthan specifying sequence. Statements referring to “at least Z of A, B, and C,” and the like (e.g. , “at least Z of A, B, or C”), refer to at least Z of the listed categories (A, B, and C) and do not require at least Z units in each category. Unless specifically stated otherwise, as apparent from the discussion, it is appreciated that throughout this specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining” or the like refer to actions or processes of a specific apparatus, such as a special purpose computer or a similar special purpose electronic processing/computing device. Features described with reference to geometric constructs, like “parallel,” “perpendicular/orthogonal,” “square”, “cylindrical,” and the like, should be construed as encompassing items that substantially embody the properties of the geometric construct, e.g., reference to “parallel” surfaces encompasses substantially parallel surfaces. The permitted range of deviation from Platonic ideals of these geometric constructs is to be determined with reference to ranges in the specification, and where such ranges are not stated, with reference to industry norms in the field of use, and where such ranges are not defined, with reference to industry norms in the field of manufacturing of the designated feature, and where such ranges are not defined, features substantially embodying a geometric construct should be construed to include those features within 15% of the defining attributes of that geometric construct. The terms “first”, “second”, “third,” “given” and so on, if used in the claims, are used to distinguish or otherwise identify, and not to show a sequential or numerical limitation. As is the case in ordinary usage in the field, data structures and formats described with reference to uses salient to a human need not be presented in a human-intelligible format to constitute the described data structure or format, e.g., text need not be rendered or even encoded in Unicode or ASCII to constitute text; images, maps, and data-visualizations need not be displayed or decoded to constitute images, maps, and data-visualizations, respectively; speech, music, and other audio need not be emitted through a speaker or decoded to constitute speech, music, or other audio, respectively. Computer implemented instructions, commands, and the like are not limited to executable code and can be implemented in the form of data that
causes functionality to be invoked, e.g., in the form of arguments of a function or API call. To the extent bespoke noun phrases (and other coined terms) are used in the claims and lack a self- evident construction, the definition of such phrases may be recited in the claim itself, in which case, the use of such bespoke noun phrases should not be taken as invitation to impart additional limitations by looking to the specification or extrinsic evidence.
[0187] In this patent, to the extent any U.S. patents, U.S. patent applications, or other materials (e.g., articles) have been incorporated by reference, the text of such materials is only incorporated by reference to the extent that no conflict exists between such material and the statements and drawings set forth herein. In the event of such conflict, the text of the present document governs, and terms in this document should not be given a narrower reading in virtue of the way in which those terms are used in other materials incorporated by reference.
[0188] Throughout this specification, including the claims that follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
[0189] The present techniques will be better understood with reference to the following enumerated embodiments:
1. A computer-implemented method, comprising: obtaining, with a computer system hosting a data exchange, a plurality of data sets from a plurality of different users of the data exchange; receiving, with the computer system, a query to search among the plurality of data sets, the query specifying a token; determining, with the computer system, a group of data sets among the plurality of data sets that is responsive to the query, wherein determining the group of data sets comprises checking titles, tables, field names, and cell values of data sets among the plurality of data sets for use of the token; determining, with the computer system, a ranking of members of the group of data sets responsive to the query; and sending, with the computer system, in response to the query, summaries of at least some members of the group of data sets with instructions to present the summaries according to the ranking.
2. The method of embodiment 1, wherein: the plurality of data sets are tabular data, at least some of the plurality of data sets having a plurality of tables in the respective data set, and at least some of the data sets among the plurality being larger than a terabyte; the query is a natural language query received from a remote client computing device of a user, the query including the token; determining the group of data sets comprises determining a synonym of
the token; determining the group of data sets comprises checking titles, tables, field names, and values of data sets among the plurality of data sets for use of the token or the synonym of the token; sending summaries comprises sending an identifier of a source of a respective data set being summarized, an amount of data determined to be absent from the respective data set being summarized, a measure of a trend in an amount of interaction with the respective data set being summarized, and a measure of how often the token or the synonym occurs in the respective data set being summarized; and sending summaries comprises sending instructions to present the summaries in a tree map that supports panning and zooming to reveal more detailed information about the at least some members of the group of data sets.
3. The method of embodiment 1, wherein: at least some of the data sets among the plurality of data sets have multiple versions tracked by the data exchange; and sending summaries comprises sending most recent versions of the at least some members of the group and instructions by which a user can navigate to earlier versions.
4. The method of any one of embodiments 1-3, comprising: for a given data set among the at least some members of the group of data sets, before receiving the query, determining a quality score of the given data set; and pre-computing summary statistics or visualizations of the given data set by: precomputing a first histogram of a first field of the given data set; precomputing a first measure of central tendency of the first field of the given data set; precomputing a second histogram of a second field of the given data set; and precomputing a second measure of central tendency of the second field of the given data set.
5. The method of any one of embodiment 4, wherein: the data sets in the plurality are accumulated overtime in the exchange, evolve overtime, and have different schemas; and precomputing summary statistics or visualizations comprises: determining a datatype of a field of the given data set; and selecting a type of data visualization from a plurality of candidate visualizations based on the data type.
6. The method of any one of embodiments 1-5, comprising: before receiving the query, indexing the data sets among the plurality according to a plurality of tokens including the token specified by the query, the index indicating which data sets among the plurality include respective ones of the tokens, wherein determining the group of data sets comprises accessing the index to identify the group of data sets associated in the token by the index.
7. The method of any one of embodiments 1-6, comprising: precomputing, before receiving the query, statistics for each of a plurality of columns or rows of a given table of a given data set among the at least some members of the group; receiving a selection of the given
table after sending the summaries; and causing the statistics to be presented after receiving the selection.
8. The method of any one of embodiments 1-7, comprising: precomputing, before receiving the query, visualizations for each of a plurality of columns or rows of a given table of a given data set among the at least some members of the group; receiving a selection of the given table after sending the summaries; and causing the visualizations to be presented after receiving the selection.
9. The method of any one of embodiments 1-8, wherein the summaries are presented in an area-based visualization.
10. The method of embodiment 9, wherein the area-based visualization is a tree-map in which each area of the tree-map corresponds to a data set in the at least some members of the group, and wherein a size of each area corresponds to an amount of occurrences of terms of the query in the corresponding data set.
11. The method of any one of embodiments 1-10, wherein: when determining the ranking, different weight is given to whether the query corresponds to terms in a table title, a field name in the table, or a value of a field.
12. The method of any one of embodiments 1-11, wherein: determining the group of data sets responsive to the query comprises searching for tables responsive to the query across the plurality of data sets.
13. The method of any one of embodiments 1-12, comprising: generating the query with a recommendation engine configured to query the data exchange for personalized collections of data sets, the personalization being based on user profiles.
14. The method of any one of embodiments 1-12, wherein: a plurality of applications connect to the data exchange; and the ranking is based on the numbers of applications configured to access different members of the group of data sets.
15. The method of any one of embodiments 1-14, comprising: receiving, with the computer system, from a first user, a selection of a first data set among the plurality of data sets, the first data set having a plurality of versions; receiving, with the computer system, a message from the first user to be shared with a second user in association with the first data set regarding collaboration between the first user and the second user on the first data set; pre-computing, with the computer system, summary statistics or visualizations for each of a plurality of fields of the first data set; causing, with the computer system, the message and the summary statistics or visualizations to be presented to the second user in association with the first data set; obtaining, with the computer system, another version of the first data set that has undergone
revision and logging transformations to the first data set and comments on the transformations to form the another version in a log of changes associated with the first data set; and causing, with the computer system, a listing of versions of the first data set and at least some of the logged changes to be presented to the second user.
16. The method of embodiment 15, wherein the log indicates: an identifier of a user who created the first data set; an identifier of each user who modified the first data set in the versions of the first data set; what changes each user who modified the first data set made; and comments by at least some of the users who modified the first data set explaining their changes.
17. The method of any one of embodiments 1-16, wherein: determining the group of data sets comprises steps for searching a data marketplace; and determining the ranking comprises steps for ranking search results.
18. The method of any one of embodiments 1-17, wherein: sending summaries comprises steps for displaying search result rankings.
19. The method of any one of embodiments 1-18, comprising: receiving, from a user, a request to access a selected data set among the at least some members of the group of data sets; determining, based on a policy mapping permissions to the user or role of the user, that the user is not permitted to access a first subset of the selected data set and that the user is permitted to access a second subset of the selected data set; and in response to the determinations regarding access, masking the first subset of the selected data and sending the masked first subset of the selected data and the second subset of the selected data to the user.
20. A tangible, non-transitory, machine -readable medium storing instructions that when executed by a data processing apparatus cause the data processing apparatus to perform the operations of any one of embodiments 1-19.
21. A system, comprising: one or more processors; and memory storing instructions that when executed by the processors cause the processors to effectuate the operations of any one of embodiments 1-19.
Claims
1. A computer-implemented method, comprising: obtaining, with a computer system hosting a data exchange, a plurality of data sets from a plurality of different users of the data exchange; receiving, with the computer system, a query to search among the plurality of data sets, the query specifying a token; determining, with the computer system, a group of data sets among the plurality of data sets that is responsive to the query, wherein determining the group of data sets comprises checking titles, tables, field names, and cell values of data sets among the plurality of data sets for use of the token; determining, with the computer system, a ranking of members of the group of data sets responsive to the query; and sending, with the computer system, in response to the query, summaries of at least some members of the group of data sets with instructions to present the summaries according to the ranking.
2. The method of claim 1, wherein: the plurality of data sets are tabular data, at least some of the plurality of data sets having a plurality of tables in the respective data set, and at least some of the data sets among the plurality being larger than a terabyte; the query is a natural language query received from a remote client computing device of a user, the query including the token; determining the group of data sets comprises determining a synonym of the token; the ranking is based on amounts of data determined to be absent from the respective data sets in the group; and sending summaries comprises sending an identifier of a source of a respective data set being summarized, an amount of data determined to be absent from the respective data set being summarized, a measure of a trend in an amount of interaction with the respective data set being summarized, and a measure of how often the token or the synonym occurs in the respective data set being summarized.
3. The method of claim 1, wherein: at least some of the data sets among the plurality of data sets have multiple versions tracked by the data exchange; and sending summaries comprises sending most recent versions of the at least some members of the group and instructions by which a user can navigate to earlier versions.
4. The method of claim 1, comprising: for a given data set among the at least some members of the group of data sets, before receiving the query, determining a quality score of the given data set.; and pre-computing summary statistics or visualizations of the given data set by: precomputing a first histogram of a first field of the given data set; precomputing a first measure of central tendency of the first field of the given data set; precomputing a second histogram of a second field of the given data set; and precomputing a second measure of central tendency of the second field of the given data set.
5. The method of claim 4, wherein: the data sets in the plurality are accumulated over time in the exchange, evolve over time, and have different schemas; and pre-computing summary statistics or visualizations comprises: determining a data type of a field of the given data set; and selecting a type of data visualization from a plurality of candidate visualizations based on the data type.
6. The method of claim 1, comprising: before receiving the query, indexing the data sets among the plurality according to a plurality of tokens including the token specified by the query, the index indicating which data sets among the plurality include respective ones of the tokens, wherein determining the group of data sets comprises accessing the index to identify the group of data sets associated in the token by the index.
7. The method of claim 1, comprising: precomputing, before receiving the query, statistics for each of a plurality of columns or rows of a given table of a given data set among the at least some members of the group; receiving a selection of the given table after sending the summaries; and causing the statistics to be presented after receiving the selection.
8. The method of claim 1, comprising: precomputing, before receiving the query, visualizations for each of a plurality of columns, rows, and values of a given table of a given data set among the at least some members of the group; receiving a selection of the given table after sending the summaries; and causing the visualizations to be presented after receiving the selection.
9. The method of claim 1, wherein the summaries are presented in an area-based visualization.
10. The method of claim 9, wherein the area-based visualization is a tree-map in which each area of the tree-map corresponds to a data set in the at least some members of the group, and wherein a size of each area corresponds to an amount of occurrences of terms of the query in the corresponding data set.
11. The method of claim 1, wherein: when determining the ranking, different weight is given to whether the query corresponds to terms in a table title, a field name in the table, or a value of a field.
12. The method of claim 1, wherein: determining the group of data sets responsive to the query comprises searching for tables responsive to the query across the plurality of data sets.
13. The method of claim 1, comprising: generating the query with a recommendation engine configured to query the data exchange for personalized collections of data sets, the personalization being based on user profiles.
14. The method of claim 1, wherein: a plurality of applications connect to the data exchange; and the ranking is based on the numbers of applications configured to access different members of the group of data sets.
15. The method of claim 1, comprising: receiving, with the computer system, from a first user, a selection of a first data set among the plurality of data sets, the first data set having a plurality of versions; receiving, with the computer system, a message from the first user to be shared with a second user in association with the first data set regarding collaboration between the first user and the second user on the first data set; pre-computing, with the computer system, summary statistics or visualizations for each of a plurality of fields of the first data set; causing, with the computer system, the message and the summary statistics or visualizations to be presented to the second user in association with the first data set; obtaining, with the computer system, another version of the first data set that has undergone revision and logging transformations to the first data set and comments on the transformations to form the another version in a log of changes associated with the first data set; and causing, with the computer system, a listing of versions of the first data set and at least some of the logged changes to be presented to the second user.
16. The method of claim 15, wherein the log indicates: an identifier of a user who created the first data set; an identifier of each user who modified the first data set in the versions of the first data set; what changes each user who modified the first data set made; and comments by at least some of the users who modified the first data set explaining their changes.
17. The method of claim 1, wherein: determining the group of data sets comprises steps for searching a data marketplace; and determining the ranking comprises steps for ranking search results.
18. The method of claim 1, wherein: sending summaries comprises steps for displaying search result rankings.
19. The method of claim 1, comprising: receiving, from a user, a request to access a selected data set among the at least some members of the group of data sets; determining, based on a policy mapping permissions to the user or role of the user, that the user is not permitted to access a first subset of the selected data set and that the user is permitted to access a second subset of the selected data set; and in response to the determinations regarding access, masking the first subset of the selected data and sending the masked first subset of the selected data and the second subset of the selected data to the user.
20. A tangible, non-transitory, machine-readable medium storing instructions that, when executed by a computing system, effectuate operations comprising: obtaining, with a computer system hosting a data exchange, a plurality of data sets from a plurality of different users of the data exchange; receiving, with the computer system, a query to search among the plurality of data sets, the query specifying a token; determining, with the computer system, a group of data sets among the plurality of data sets that is responsive to the query, wherein determining the group of data sets comprises checking titles, tables, field names, and cell values of data sets among the plurality of data sets for use of the token; determining, with the computer system, a ranking of members of the group of data sets responsive to the query; and sending, with the computer system, in response to the query, summaries of at least some members of the group of data sets with instructions to present the summaries according to the ranking.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463574675P | 2024-04-04 | 2024-04-04 | |
| US63/574,675 | 2024-04-04 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025210554A1 true WO2025210554A1 (en) | 2025-10-09 |
Family
ID=97266628
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2025/053497 Pending WO2025210554A1 (en) | 2024-04-04 | 2025-04-03 | Data set discovery in data exchanges |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025210554A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160179922A1 (en) * | 2014-12-19 | 2016-06-23 | Software Ag Usa, Inc. | Techniques for real-time generation of temporal comparative and superlative analytics in natural language for real-time dynamic data analytics |
| US20170337265A1 (en) * | 2016-05-17 | 2017-11-23 | Google Inc. | Generating a personal database entry for a user based on natural language user interface input of the user and generating output based on the entry in response to further natural language user interface input of the user |
| US20170364539A1 (en) * | 2016-06-19 | 2017-12-21 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
| US20200110803A1 (en) * | 2018-10-08 | 2020-04-09 | Tableau Software, Inc. | Determining Levels of Detail for Data Visualizations Using Natural Language Constructs |
| US20200301916A1 (en) * | 2015-04-15 | 2020-09-24 | Arimo, LLC | Query Template Based Architecture For Processing Natural Language Queries For Data Analysis |
-
2025
- 2025-04-03 WO PCT/IB2025/053497 patent/WO2025210554A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160179922A1 (en) * | 2014-12-19 | 2016-06-23 | Software Ag Usa, Inc. | Techniques for real-time generation of temporal comparative and superlative analytics in natural language for real-time dynamic data analytics |
| US20200301916A1 (en) * | 2015-04-15 | 2020-09-24 | Arimo, LLC | Query Template Based Architecture For Processing Natural Language Queries For Data Analysis |
| US20170337265A1 (en) * | 2016-05-17 | 2017-11-23 | Google Inc. | Generating a personal database entry for a user based on natural language user interface input of the user and generating output based on the entry in response to further natural language user interface input of the user |
| US20170364539A1 (en) * | 2016-06-19 | 2017-12-21 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
| US20200110803A1 (en) * | 2018-10-08 | 2020-04-09 | Tableau Software, Inc. | Determining Levels of Detail for Data Visualizations Using Natural Language Constructs |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11775547B2 (en) | Computerized tools to discover, form, and analyze dataset interrelations among a system of networked collaborative datasets | |
| US11327996B2 (en) | Interactive interfaces to present data arrangement overviews and summarized dataset attributes for collaborative datasets | |
| US20220337978A1 (en) | Computerized tool implementation of layered data files to discover, form, or analyze dataset interrelations of networked collaborative datasets | |
| US11609680B2 (en) | Interactive interfaces as computerized tools to present summarization data of dataset attributes for collaborative datasets | |
| US20240281419A1 (en) | Data Visibility and Quality Management Platform | |
| US11704321B2 (en) | Techniques for relationship discovery between datasets | |
| US11163527B2 (en) | Techniques for dataset similarity discovery | |
| US11200248B2 (en) | Techniques for facilitating the joining of datasets | |
| US10691710B2 (en) | Interactive interfaces as computerized tools to present summarization data of dataset attributes for collaborative datasets | |
| Schintler et al. | Encyclopedia of big data | |
| US20250253016A1 (en) | Adaptive clinical trial data analysis using ai-guided visualization selection | |
| US20250259144A1 (en) | Platform for integration of machine learning models utilizing marketplaces and crowd and expert judgment and knowledge corpora | |
| WO2025210554A1 (en) | Data set discovery in data exchanges | |
| WO2025210555A1 (en) | Social-graph overlay in data exchanges | |
| Sarwar | Recommending whom to follow on GitHub | |
| Cao et al. | E-commerce Big Data Technical System | |
| Kemp | Scholarly Usage and Impact Vocabularies Glossary and Crosswalk | |
| Li | Smart recommendations in E-commerce using convolutional neural networks and collaborative filtering | |
| Al Kadah | A new framework for decentralized social networks: harnessing blockchain, deep learning, and natural language processing= Merkezsiz sosyal ağlar için yeni bir çerçeve: blok zinciri, derin öğrenme ve doğal dil işlemeyi kullanmak |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25781991 Country of ref document: EP Kind code of ref document: A1 |