WO2024261225A1 - Data item classification - Google Patents
Data item classification Download PDFInfo
- Publication number
- WO2024261225A1 WO2024261225A1 PCT/EP2024/067408 EP2024067408W WO2024261225A1 WO 2024261225 A1 WO2024261225 A1 WO 2024261225A1 EP 2024067408 W EP2024067408 W EP 2024067408W WO 2024261225 A1 WO2024261225 A1 WO 2024261225A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- probability
- data
- data item
- threshold
- invoice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q20/00—Payment architectures, schemes or protocols
- G06Q20/38—Payment protocols; Details thereof
- G06Q20/40—Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
- G06Q20/401—Transaction verification
- G06Q20/4016—Transaction verification involving fraud or risk level assessment in transaction processing
Definitions
- the present invention relates to techniques for classifying whether data items, such as electronic invoices, are associated with a particular type of transaction, specifically recurrent and trusted transactions.
- Substantial data processing and data storage efficiencies can be achieved in accounting software systems by identifying whether data items such as invoices are associated with "recurrently occurring and trusted transactions”. These are frequent and regular transactions between vendors and buyers that have a low probability of being irregular or unauthorised and are therefore “trusted”. Identifying invoices, or similar data items, related to these transactions within accounting software systems can greatly improve the efficiency of invoice processing, as invoice processing and payment can be expedited with reduced checking, or increased or complete automation, and bypassing more complex processing that is applied to other invoices.
- Machine learning solutions using time-series forecasting have also been explored as a means of identifying recurrently occurring and trusted transactions within accounting software systems. While these solutions can be more adaptive, they may yield less reliable predictions when invoice data deviates even slightly from historical data. This sensitivity to small data changes can result in false positives or negatives, leading to inefficiencies in the invoice processing workflow. Additionally, these techniques typically necessitate manual threshold setting to some extent, which can further slow the process and introduce potential errors.
- a computer implemented method for classifying received data items as being associated with recurrently occurring and trusted transactions comprises the steps of: receiving a data item from a source; extracting a plurality of feature values from the data item, each feature value associated with one of a plurality of predefined data item features; retrieving from storage a probability data set and a threshold data set associated with a source-recipient pair, said source-recipient pair comprising the source and an intended recipient of the data item, wherein the probability data set includes a plurality of probability distributions, each probability distribution representing the likelihood that one of the plurality of predefined data item features will take a given value, each probability distribution based on a data set of data item feature values from a corpus of previously received data items from the source sent to the recipient, and the threshold data set comprises a plurality of probability thresholds, each defining a probability threshold defining a probability relative to one of the probability distributions; and for each extracted feature value: determining, using
- the method further comprises the steps of: transmitting the classification data to a user device; and receiving feedback data comprising a validation or a rejection of the classification from a user via the user device, wherein the classification data is updated based on the feedback data if a rejection is received from the user via the user device.
- the method upon receiving a validation or rejection of the classification in the feedback data from the user via the user device, the method further comprises the steps of: updating the probability thresholds based on the feedback data; and storing the updated probability thresholds in the storage for use in classifying future received data items.
- the plurality of probability distributions comprises a plurality of probability density functions.
- the probability threshold for each probability distribution is determined based on the entropy of the data set of data item feature values to which the probability distribution relates.
- each probability density function is derived by applying a kernel density estimation function to the data set of data item feature values to which the probability density function relates.
- each probability threshold defines at least one region of highest probability density of the respective probability density function
- the step of determining, using the corresponding probability distribution and probability threshold, if a probability that the data item feature takes the extracted feature value exceeds the probability threshold comprises: determining if the extracted feature value is associated with a probability within a region of highest probability density of the respective probability distribution function defined by the probability threshold.
- the method further comprises the steps of, for each extracted feature value: calculating a probability value, using the corresponding probability distribution, indicative of a probability that the associated data item feature will take that value; and generating a score for the received data item when it is classified to be trusted using the calculated probability values.
- the step of generating a score for the received data item comprises summing the calculated probability values multiplied by a scaling factor, wherein the scaling factor is based on a size of the corpus of previously received data items from the source sent to the recipient, wherein larger corpora lead to a higher scaling factor, reflecting increased confidence in the classification.
- the data item received from the source is an invoice data item;
- the plurality of predefined data item features are predefined invoice features, and each probability distribution is based on a data set of invoice data item feature values from a corpus of previously received invoice data items from the source sent to the recipient.
- the method comprises generating classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
- the method further comprises initiating a first transaction processing process based on content of the received data item if the classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
- the method further comprises initiating a second transaction processing process based on the content of the received data item if the classification data indicative of the received data item not being associated with a recurrently occurring and trusted transaction.
- the first transaction processing process is an automated or partially automated transaction process.
- a system for classifying received data items as being associated with recurrently occurring and trusted transactions comprises storage on which is stored a plurality of probability data sets and threshold data sets, each probability data set and threshold data set associated with a source-recipient pair, each probability data set comprising a plurality of probability distributions, each probability distribution representing the likelihood that one of a plurality of predefined data item features will take a given value based on a data set of data item feature values from a corpus of previously received data items from the source sent to the recipient with which the probability data set is associated, and each threshold data set comprising a plurality of probability thresholds, each defining a probability threshold defining a probability relative to one of the probability distributions.
- the system further comprises a data item classification module comprising a data item processing module and a classification decision module, wherein said data item processing module is configured to: receive a data item from a source; extract a plurality of feature values from the data item, each value associated with one of a plurality of predefined data item features; retrieve from the storage a probability data set and a threshold data set associated with a source-recipient pair, said sourcerecipient pair comprising the source and an intended recipient of the data item, and the classification decision module is configured, for each extracted feature value to: determine, using the corresponding probability distribution and probability threshold, if a probability that the data item feature takes the extracted value exceeds the probability threshold, and if, for at least a predetermined proportion of the extracted feature values, the probability that the data item feature takes the extracted value exceeds the probability threshold, the classification decision module is configured to generate classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
- a data item classification module for use in a system according to the second aspect, comprising a data item processing module and a classification decision module, wherein said data item processing module is configured to: receive a data item from a source; extract a plurality of feature values from the data item, each feature value associated with one of a plurality of predefined data item features; retrieve from storage a probability data set and a threshold data set associated with a source-recipient pair, said source-recipient pair comprising the source and an intended recipient of the data item, wherein the probability data set includes a plurality of probability distributions, each probability distribution representing the likelihood that one of the plurality of predefined data item features will take a given value, each probability distribution based on a data set of data item feature values from a corpus of previously received data items from the source sent to the recipient, and the threshold data set comprises a plurality of probability thresholds, each defining a probability threshold defining a probability relative to one of the probability distributions, and
- a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of the first aspect.
- a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of the first aspect.
- a technique of classifying whether a received document is associated with a recurrently occurring and trusted transaction is provided.
- the invention uses probability distributions and probability thresholds to determine whether a data item, such as an electronic invoice is associated with a recurrently occurring and trusted transaction.
- the method extracts a plurality of feature values from a received data item, each feature value associated with one of a predefined set of data item features. It then retrieves probability data sets and threshold data sets associated with a source-recipient pair, comprising the source and the intended recipient of the document.
- the probability data set includes multiple probability distributions, each representing the likelihood that one of the predefined document features will take a given value. These distributions are based on a data set of data item feature values from a corpus of previously received data items from the source sent to the recipient.
- the threshold data set comprises a plurality of probability thresholds, each defining a probability relative to one of the probability distributions.
- the method For each extracted feature value, it is determined if the probability that the data item feature takes the extracted feature value exceeds the probability threshold. If this is true for at least a predetermined proportion of the extracted feature values, the method generates classification data indicative of the received document being associated with a recurrently occurring and trusted transaction.
- This classification can then be used to initiate a suitable process, for example an automated or semi-automated invoice payment process.
- This invention offers several advantages over previous methods by overcoming the limitations of rule-based solutions and time-series forecasting techniques.
- the technique can be easily expanded to handle a large number of vendors and transactions without the need for manually creating and maintaining rules, making it a more efficient solution for identifying regular and trustworthy transactions. Additionally, it can adapt to changes in invoice patterns more effectively, as it uses probability distributions and thresholds instead of fixed rules, providing a more flexible approach.
- the accuracy of classification is improved due to the use of probability distributions and thresholds in determining whether a document is associated with a recurrently occurring and trusted transaction.
- the technique captures the patterns and variations in the invoices more effectively than rule-based solutions or time-series forecasting.
- This probabilistic approach allows the system to better identify and adapt to invoice patterns and changes, leading to fewer false positives and negatives.
- the classification of data items as being associated with recurrently occurring and trusted transactions is more accurate, contributing to a more efficient and reliable processing workflow.
- the technique automates the classification of data items, reducing the need for manual threshold setting and rule maintenance, ultimately simplifying, for example, an invoice processing workflow.
- Figure 1 provides a simplified schematic diagram depicting a system in which a technique for classifying invoice data items as being associated with recurrently occurring and trusted transactions is implemented, in accordance with certain embodiments of the invention
- Figure 2 provides a diagram depicting a process undertaken by the system depicted in Figure 1 , in accordance with certain embodiments of the invention
- Figure 3 provides a diagram depicting a process for classifying an invoice data item as being associated with a recurrently occurring and trusted transaction, in accordance with certain embodiments of the invention
- Figure 4 provides a diagram depicting a process for generating a probability distribution, in accordance with certain embodiments of the invention.
- Figure 5 provides a diagram depicting a process of generating a probability threshold, in accordance with certain embodiments of the invention.
- Figure 6A and 6B depict probability distributions of the occurrence of feature values in invoice data items, in accordance with certain embodiments of the invention.
- Figure 7 provides a diagram depicting a process for generating a confidence score, in accordance with certain embodiments of the invention.
- Figure 8 provides a diagram depicting the variance of a scaling factor used to generate a confidence score, in accordance with certain embodiments of the invention.
- FIG. 9 provides a simplified schematic diagram depicting an example system in which techniques in accordance with certain embodiments of the invention can be implemented.
- Figure 10 provides a simplified schematic diagram depicting an implementation of a scheme for automatically initiating a payment associated with a received invoice data item.
- Figure 1 provides a simplified schematic diagram of an invoice classification system 101 , arranged in accordance with certain embodiments of the invention.
- the system is arranged to receive an electronic invoice issued by a vendor and sent to a buyer and then classify whether or not the invoice relates to a recurrently occurring and trusted transaction (RTT).
- RTT recurrently occurring and trusted transaction
- This classification is based on a probabilistic analysis of values that certain features of the invoice take with respect to invoices that have been previously sent from the vendor to the buyer. If it is determined that the invoice relates to a recurrently occurring and trusted transaction, the invoice is forwarded for payment processing to an automated payment process. On the other hand, if it is determined that the invoice does not relate to a recurrently occurring and trusted transaction, the invoices forwarded to a payment process that involves a manual review.
- the system includes a historical invoice analysis module 102, an invoice classification module 103, and a payment processing module 1O4. These components are typically integrated into a larger accounting services system which provides accounting services to a user.
- Figure 1 also illustrates a source computer system 105, which sends an invoice to the invoice classification module 103 through a network 106.
- the electronic invoice can take the form of any suitable electronic data item, for example an email, a PDF document, Microsoft Word or Excel files, image files (such as JPEG or PNG), plain text files, or other electronic file formats such as CSV, XML, or JSON.
- image files such as JPEG or PNG
- plain text files or other electronic file formats such as CSV, XML, or JSON.
- the invoice can be communicated to the invoice classification module 103 in any suitable way depending, for example, on the way in which the accounting services system provides accounting services to the user device 117.
- the invoice can be sent directly from the source computer system 105 to the invoice classification module 103, for example, as an attachment to an email.
- the source computer system 105 may send the invoice as an attachment to an email to the user device 117 which is then forwarded by software running on the user device 117 to the invoice classification module 103.
- the user device 117 may receive the invoice from the source computer system 105 and upload the invoice via a suitable interface for processing to the invoice classification module 103.
- the historical invoice analysis module 102 comprises a historical invoice data database 107, on which is stored invoices previously sent from various vendors to various buyers.
- the historical invoice analysis module 102 further includes an invoice feature analysis function 108, a probability data and threshold data calculation unit 109, and a probability data and threshold data storage unit 110.
- the invoice classification module 103 comprises an incoming invoice processing unit 111 and a classification decision module 112. Additionally, the invoice classification module 103 features a reinforcement learning function 113.
- the payment processing module 104 consists of an automatic approval module 114, a manual approval module 115, and an approved payment database 116 (provided specifically by a general ledger and accounts payable database).
- the historical invoice analysis module 102 is substantially an offline module, configured to generate and store probability data and threshold data ahead of the operation of invoice classification module 103.
- the invoice classification module 103 is an online module, operating substantially in "real-time.”
- the historical invoice analysis module 102 analyses previously exchanged invoices to determine the frequency with which certain invoice features take certain values in the previously issued invoices. This information is then used to calculate a probability distribution for each invoice feature.
- the probability distribution shows the distribution of probabilities that each invoice feature will take a specific value in future invoices exchanged between the particular vendor-buyer pair.
- a probability threshold is then determined. This threshold is set so that, with reference to the relevant probability threshold, if a probability of an invoice feature value occurring exceeds the threshold probability, then, this is suggestive of the invoice being an invoice relating to a recurrently occurring and trusted transaction. If this is the case for multiple feature values, this is sufficient to classify the invoice as relating to a recurrently occurring and trusted transaction.
- An invoice received by the invoice classification module 103 is then analysed to identify the value that each predetermined invoice feature takes in that invoice, and the probability distributions are then used to determine whether or not the probability of these feature values occurring exceeds the relevant probability thresholds. If a probability criterion is met, typically that the probability thresholds are exceeded for all the identified feature values, then the invoice is classed as a recurrently occurring and trusted transaction invoice and forwarded for automatic processing.
- the historical invoice data database 107 has stored thereon data associated with multiple vendor-buyer pairs. This data is arranged into a plurality of different corpuses, each corpus comprising a plurality of invoices associated with a particular vendor-buyer pair.
- the invoice feature analysis function 108 is configured to process each invoice to locate the predetermined invoice features, and then identify the values that these features take. A data set of these feature values is then generated.
- the predefined invoice features are typically selected on the basis that the values that they take are indicative of whether or not a transaction is a recurrently occurring and trusted transaction.
- Invoice features that work optimally in this regard include those that tend to have values that are the same or similar across most or all recurring trusted transactions but vary across the broader corpus of invoices that also contains invoices which do not relate to recurrently occurring and trusted transactions. Examples include “invoice amount” - i.e. the numerical amount of money associated with a transaction because recurring trusted transactions between a given vendor and buyer typically have the same or similar value; and “number of days since last issue invoice issued” because invoices associated with recurring trusted transactions, will typically be issued at regularly spaced intervals.
- Another suitable invoice feature is a “temporal” value combination.
- a combination of temporal-related values for example, the combination of issue date (e.g. 2), issue day of the week (e.g. Monday), issue month (e.g. January) and issue year (e.g. 2023).
- issue date e.g. 2
- issue day of the week e.g. Monday
- issue month e.g. January
- issue year e.g. 2023
- the invoice feature analysis function 108 is configured to communicate a data set indicative of the values of the invoice features identified in the corpus of invoice data items to the probability data and threshold data calculation unit 109. For example, with reference to the invoice features described above, this would include a data set comprising all the extracted invoice-amount values; a data set comprising all the extracted number-of-days-since-last-invoice-issued values and all of the temporal-related value combination values.
- the probability data and threshold data calculation unit 109 is configured to then perform a probability distribution algorithm to this data to generate for each invoice feature a probability distribution showing the distribution of probabilities that the predefined invoice feature will take a range of different values in invoices exchanged between the vendor-buyer pair.
- this would comprise three probability distributions per vendor-buyer pair, namely: a probability distribution of invoice-amount values; a probability distribution of days-since-last invoice-issued values and a probability distribution for each identified combination of temporal-related values.
- the probability data and threshold data calculation unit 109 then applies a threshold setting algorithm to each probability distribution to generate a probability threshold for each probability distribution.
- the probability setting algorithm is configured to determine a probability threshold for each probability distribution. If an invoice feature value appears in an invoice which the probability distribution indicates has a probability of occurring that exceeds this threshold, this is indicative of the transaction to which the invoice on which the invoice appears, being a recurrent invoice. As will be described in more detail below, the presence of several such invoice feature values provides the basis on which a received invoice is classified as relating to a recurrently occurring and trusted transaction.
- an invoice data item generated by the source computer system 105 is received via the network 106 at the invoice processing unit 111.
- the invoice processing unit 111 analyses the invoice data item to determine the buyer-vendor pair (source-recipient pair) with which it is associated, specifically to determine the vendor it is from and the buyer for whom it is intended. This can be done in any suitable way. For example, one approach is to use Optical Character Recognition (OCR) to extract text from the invoice data item and employ pattern matching or natural language processing techniques to identify the vendor and buyer information.
- OCR Optical Character Recognition
- Another approach involves parsing structured data, such as XML or EDI, to extract the relevant fields directly and determine the vendor and buyer. If the invoice was sent as an email or as an attachment in an email, analysing email metadata, including the sender, recipient, or subject line, can be used to deduce the vendor and buyer.
- structured data such as XML or EDI
- searching for specific keywords or phrases can help identify the vendor and buyer. Relying on pre-defined templates is another option; if the invoice follows a known template, the vendor and buyer information can be extracted based on the expected layout and data fields.
- the buyer may be an organisation associated with the user device 117 and the vendor (the source) may be an organisation associated with the source computer system 105.
- the invoice processing unit 111 loads the probability data (a probability data set) and the threshold data (a threshold data set) associated with the identified buyer-vendor pair from the probability data and threshold data storage unit 110.
- the probability data may include a probability distribution associated with values the invoice-amount invoice feature could take; a probability distribution associated with values the number-of-days-since-last-invoice invoice feature could take, and a probability distribution associated with values the temporal values combination invoice feature could take.
- a value for the invoice amount for example $100
- a value for the number of days since last invoice for example 28
- the values for features such as the invoice amount and temporal features can be determined using text recognition algorithms if the invoice data item contains text data or Optical Character Recognition (OCR) technology if the invoice data item is an image file. In the latter case, OCR is used to convert the image data into text, which is then analysed to extract the required feature values.
- OCR Optical Character Recognition
- the values cannot be derived directly from the invoice itself and must be determined with reference to additional data.
- the invoice processing unit 111 accesses an external data source containing the invoice history of the vendor, such as a database or spreadsheet. The system retrieves the date of receipt for the previous invoice from the vendor and compares it with the current invoice's date of receipt. The difference in days between the two invoices is then determined, providing the value for the "days since the last invoice" feature.
- the feature values determined at the fourth step S204 and the probability data and threshold data loaded at the third step S203 are passed to the classification decision module 112.
- the classification decision module 112 performs a recurrently occurring and trusted transaction classification algorithm. Specifically, for each feature value identified at the fourth step S204, it is determined if the probability of that feature value occurring exceeds the relevant predetermined probability threshold.
- the RTT classification algorithm determines that the invoice relates to a recurrently occurring and trusted transaction.
- the classification decision module 112 communicates the corresponding RTT classification data to the user device 117 via the network 106. This is data indicating whether or not the classification decision module 112 has classified the received invoice as relating to a recurrently occurring and trusted transaction.
- the user device 117 On receipt of the RTT classification data, the user device 117 is configured to present, via an interface, the classification decision to the user and prompt the user to provide classification feedback data either confirming (validating) or rejecting the RTT classification. This classification data is then returned (transmitted) to the classification decision module 112 via the network 106. If the classification decision module 112 has generated classification data indicating the received invoice data item relates to a recurrently occurring and trusted transaction and the user provides classification feedback data validating that the RTT classification data is correct, at an eighth step S208, the classification decision module 112 communicates this validated RTT classification data to the automatic approval module 114 of the payment processing module 104.
- the automatic approval module 114 On receipt of this RTT classification data, which indicates that the received data item relates to a recurrently occurring and trusted transaction, the automatic approval module 114 initiates an automated approval of the payment for the transaction to which the received invoice data item relates. Typically, this includes extracting the relevant payment data from the invoice and forwarding this to the general ledgerand accounts payable database 116 and initiating the relevant processes for the vendor to get paid.
- the classification decision module 112 if the classification decision module 112 generates classification data indicating that the received invoice data item does not relate to a recurrently occurring and trusted transaction, and the user provides classification feedback data validating that this RTT classification data is correct, then the classification decision module 112 communicates this validated RTT classification data to the manual approval module 115.
- the manual approval module 115 On receipt of this validated RTT classification data, which indicates that the received data item does not relate to a recurrently occurring and trusted transaction, the manual approval module 115 initiates a process whereby payment of the transaction to which the invoice relates must be manually approved, for example via a user of the user device 117. If this manual approval process results in the invoice being approved, the relevant payment data is communicated to the general ledger and accounts payable database 116.
- the classification decision module 112 is configured to update the classification data based on this feedback data.
- the classification decision module 112 if the classification decision module 112 generates classification data indicating that the received invoice data item relates to a recurrently occurring and trusted transaction, but the user rejects the RTT classification by providing classification feedback data rejecting this classification - i.e. indicating that the RTT classification data is incorrect and the received invoice data item does not relate to a recurrently occurring and trusted transaction, the classification decision module 112 updates the classification data to indicate that the received invoice data item does not relate to a recurrently occurring and trusted transaction, and at a ninth step S209, the classification decision module 112 communicates the updated RTT classification data (indicating a non-RTT classification) to the manual approval module 115.
- the manual approval module 115 On receipt of this classification data, the manual approval module 115 then initiates a process whereby payment of the transaction to which the invoice relates must be manually approved, for example via a user of the user device 117. If this manual approval process results in the invoice being approved, the relevant payment data is communicated to the general ledger and accounts payable database 116.
- the classification decision module 112 if the classification decision module 112 generates classification data indicating that the received invoice data item does not relate to a recurrently occurring and trusted transaction, but the user rejects the RTT classification by providing classification feedback data rejecting this classification - i.e. indicating that the RTT classification data is incorrect and the received invoice data item does in fact relate to a recurrently occurring and trusted transaction, the classification decision module 112 updates the classification data to indicate that the received invoice data item does relate to a recurrently occurring and trusted transaction.
- the classification decision module 112 then communicates the updated RTT classification data (indicating a RTT classification) to the automatic approval module 114 of the payment processing module 104.
- the automatic approval module 114 On receipt of the updated RTT classification data, which indicates that the received data item relates to a recurrently occurring and trusted transaction, the automatic approval module 114 initiates an automated approval of the payment for the transaction to which the received invoice data item relates.
- the reinforcement learning function 113 generates threshold update data indicative of the feedback data provided by the user. This threshold update data is communicated to the probability data and threshold data calculation unit 109.
- the probability data and threshold data calculation unit 109 identifies the threshold data associated with the buyer-vendor pair and updates it in accordance with the threshold update data, then stores the updated threshold data in the probability data and threshold data storage unit 110 for use in classifying future received data items.
- the threshold update data generated by the reinforcement learning function 113 depends on the feedback data provided by the user. For example, if the feedback data indicates that the RTT classification algorithm has correctly classified the invoice, then the reinforcement learning function may send no threshold update data and instead apply a positive weight to the thresholds associated with the relevant vendor-buyer pair to make it less likely that those thresholds will be adjusted.
- the reinforcement learning function 113 may send threshold update data to the probability data and threshold data calculation unit 109 to increase the threshold probability by a predetermined amount, to reduce the likelihood of future false positives.
- the reinforcement learning function 113 may send threshold update data to the probability data and threshold data calculation unit 109 to decrease the threshold probabilities by a predetermined amount, to reduce the likelihood of future false negatives.
- Figure 3 provides a diagram depicting an example implementation of the RTT classification algorithm described with reference to the sixth step S206.
- one of the identified feature values identified from the received invoice by the invoice processing unit 111 is loaded along with the corresponding probability distribution and the probability threshold associated with the invoice feature to which that identified feature value relates.
- the invoice is classified as an invoice that does not relate to a recurrently occurring and trusted transaction. This is because, in this example, for an invoice data item to be classified as being associated with an recurrently occurring and trusted transaction, it is necessary for the probability of each feature value occurring to exceed the relevant predetermined threshold probability.
- an invoice may be classified as being associated with a recurrently occurring and trusted transaction if the probability of only a predetermined proportion (rather than all) of the identified feature values occurring, exceeds the respective probability thresholds. For example, if the feature values for three predetermined invoice features are identified from an invoice it may be sufficient that the probability of two of those invoice feature values exceeding the probability threshold would be sufficient for the invoice to be classified as being associated with a recurrently occurring and trusted transaction.
- the algorithm returns to the first step S301 and loads the next identified feature value, corresponding probability distribution and corresponding probability threshold.
- a fourth step S304 the invoice is classified as being associated with a recurrently occurring and trusted transaction.
- the probability data and threshold data calculation unit 109 is configured to perform a probability distribution algorithm to the data sets of invoice feature values it receives from the invoice feature analysis function 108.
- the probability distribution algorithm generates for each invoice feature a probability distribution indicative of a likelihood that the predefined invoice feature will take a given value.
- the plurality of probability distributions comprises a plurality of probability density functions.
- the probability distribution algorithm applies a KDE function to the data sets of invoice feature values to produce for each invoice feature a continuous probability distribution, in other words a probability density function. This is advantageous because sharp discontinuities in the input data are smoothed, allowing for more accurate representation and analysis of the underlying data structure. Further, as described in more detail below, thresholding can be done in terms of regions of highest probability density, which is computationally efficient. Moreover, the variables of the kernel used in the KDE function can be optimised for each different invoice feature.
- an optimal kernel shape and optimal bandwidth which balances the smoothness of the distribution produced with sensitivity to the variance in the underlying data set.
- Use of a KDE function means that this bandwidth can be selected for each different invoice feature.
- an optimal kernel shape has been found to be a Gaussian kernel with a bandwidth of “5” where:
- / ⁇ (x) is the computed KDE for each point x with contribution from point x L and K is the kernel used.
- the bandwidth of the kernel is h.
- the kernel has the form
- Figure 4 provides a flow diagram depicting a probability distribution algorithm in accordance with certain examples of the invention.
- a data set of feature values is received for a predefined data item feature which have been extracted by the invoice feature analysis function 108 as described above.
- this data set of feature values may comprise a data set indicating all the values identified for the “invoice-amount” feature identified from a corpus of invoices relating to a particular vendorbuyer pair.
- a KDE function is applied to this data set.
- the KDE function or Kernel Density Estimation, works by placing a kernel (a smooth, continuous function such as a Gaussian) at each data point in the dataset. The kernels are then summed, and the resulting function is normalised to produce a continuous probability distribution that represents the underlying structure of the data. This continuous distribution estimates the probability density of the data.
- the continuous probability distribution that is output by the KDE function is stored in the probability data and threshold data storage unit 110.
- the probability distribution algorithm depicted in Figure 4 is repeated for each data set of extracted invoice feature values producing a probability distribution associated with each predefined invoice feature per vendor-buyer pair, for example: a probability distribution of invoice-amount values; a probability distribution of days-since-last-invoice issued values and a probability distribution for each identified combination of temporal features.
- the probability data and threshold data calculation unit 109 then performs a threshold setting algorithm to each probability distribution to generate, for each probability distribution, a probability threshold.
- this threshold is set as a function of the entropy of the data set of invoice feature values. This ensures that the threshold is sensitive to the underlying variability of the data set of invoice feature values, in particular accounting for the degree of uniformity in the data distribution. This is important because the more uniform the data set (i.e., the smaller the degree to which the values for the invoice feature vary from invoice to invoice), the less information is contained within the variation of the feature values, and the less likely that any of them are RTTs.
- the algorithm is able to adjust the threshold according to the degree of uniformity in the data set.
- the threshold will be set closer to the maximum probability density, ensuring that only the most significant variations in the feature values are considered for further analysis.
- the threshold will be set lower, allowing for a broader range of values to be included in the analysis. This adaptive approach helps to reduce the likelihood of false positives and false negatives during invoice classification.
- Figure 5 provides a flow diagram depicting a threshold setting algorithm based on the entropy ratio of the data distribution in accordance with certain examples of the invention.
- a first step S501 the data set and its corresponding probability distribution, generated using the KDE function, are received.
- the entropy (Hd) of the data distribution of the data set is calculated.
- entropy (Hu) of a uniform distribution with the same number of points as the underlying data distribution of the data set is calculated.
- the highest probability density (p m ax) and the range of probability density values (Prange) are determined, measured as the difference between the maximum and minimum probability density values from the received probability distribution.
- the entropy ratio component (1 - Hr(p)) is computed for the probability distribution of the data set, which indicates how non-uniform the distribution is.
- the computed threshold tf is indicative of a region of highest probability density of the probability distribution which is described further with reference to Figure 6A.
- Figure 6A provides a diagram depicting an example probability distribution, which is generated using the KDE function as described in the Figure 4 algorithm, and its corresponding threshold (tf). As will be understood, this probability distribution corresponds to a probability density function associated with the likelihood of an invoice data item feature taking a given value.
- the x-axis represents the data values
- the y-axis represents the probability density values.
- the probability distribution is illustrated as a continuous Gaussian-like curve, resulting from the KDE function applied in the probability distribution algorithm described with reference to Figure 4.
- the threshold (tf) is calculated using the entropy-based algorithm outlined in Figure 5.
- the threshold value is visually represented as a horizontal line on the probability distribution.
- the threshold tf defines a region R of greatest probability density which is denoted by the shaded area under the curve.
- Data item feature values on the x-axis within this region represent the values with the highest probability densities and thus those that are most likely to occur.
- S302 it is determined, using the loaded probability distribution and probability threshold value if a probability of the identified value of the data item feature occurring exceeds the probability threshold value.
- An example of how this can be performed can be understood with reference to Figure 6A.
- a data item feature value has a value along with x-axis that falls within the region R, it is within the region of highest probability density defined by the threshold tf. Therefore, to determine if a probability of the identified value of the data item feature occurring exceeds the probability threshold value, it is simply a matter of determining whether or not a data item feature value is within this region.
- Using the region of highest probability density as set by the threshold offers a computationally efficient approach for determining if the probability threshold is exceeded. This method means that a single operation need be performed, i.e., comparing the identified value's probability against the threshold, rather than alternative techniques that might involve a sequence of more complex operations, for example, calculating individual probabilities for all possible values, conducting multiple hypothesis tests, or iteratively updating Bayesian models.
- the example probability distribution shown in Figure 6A depicts a simple probability density function with a single peak.
- the probability density function may be multi-modal with two or more peaks some of which may have regions of highest probability density.
- the probability density function defining the probability distribution may have multiple regions of highest probability density within which the probability of a value occurring may fall.
- Such a probability distribution might occur, for example, for the invoice data item feature of invoice-amount where recurrently occurring and trusted transactions typically take one of two or more distinct values, for example either $50 or $100, or some slightly perturbed values, for example $52 or $101 .
- FIG. 6B An example of a probability distribution comprising such a multi-modal probability density function is shown in Figure 6B. As can be seen from Figure 6B, if an invoice-amount value has a value that falls with a first region of highest density R1 centred around the value $50 or a value that falls with a second region of highest density R2 centred around the value $100 it is within a region of highest probability density defined by the threshold tf.
- the probability distributions are generated in the form of continuous probability distributions, specifically probability density functions.
- some or all of the probability distributions can be calculated using different techniques.
- the probability distributions can be represented as histograms or normalised histograms with a plurality of bins, each representing a discrete value that the invoice data item feature can take. The height of each bin corresponds to the frequency of occurrence of that particular invoice data item feature within the dataset of invoice feature values identified from the corpus of invoice data items.
- the probability threshold may be set by considering the frequency of occurrence for each discrete value in the distribution. This can be done by selecting a suitable frequency threshold, such that only those discrete values with a frequency higher than the threshold are considered likely or probable. In keeping with the example described above, this probability threshold can be set based on the entropy of the underlying dataset.
- the probability thresholds for determining likely or probable invoice feature values are set using a predetermined static factor, rather than employing entropy-based techniques.
- the static factor is applied differently based on the type of probability distribution associated with the invoice feature values.
- the probability threshold is determined by multiplying the predetermined static factor by the maximum probability density.
- the probability threshold is established by multiplying the predetermined static factor by the maximum probability value.
- a predetermined static factor of 0.9 is selected for setting the probability thresholds. This approach allows for a consistent and straightforward method of determining probability thresholds, which can be easily adjusted by altering the static factor to accommodate various degrees of confidence or specificity in the analysis of invoice features.
- the classification decision module 112 is configured to perform a confidence score algorithm if the RTT classification algorithm determines that the invoice relates to a recurrently occurring and trusted transaction. This algorithm produces a confidence score output which is communicated to the user device 117 along with the corresponding RTT classification data (indicating whether the invoice has been classed as being associated with a recurrently occurring and trusted transaction). The confidence score is indicative of a degree of certainty associated with the classification.
- Software running on the user device 117 is configured to display the confidence score to the user when they are confirming or rejecting the RTT classification performed by the RTT classification algorithm.
- Providing a confidence score in this way improves decision-making by offering users additional information about the classification and helps prioritise manual review by indicating which invoices may need closer scrutiny based on the confidence score. Additionally, the provision of a confidence score may enhance user trust in the system by making the classification process more transparent.
- Figure 7 provides a flow diagram depicting a confidence score calculation algorithm.
- the probability values for each identified invoice feature value of the received invoice data item are calculated using the relevant probability distribution. This information can be obtained during the second step S302 of the invoice classification algorithm.
- the probability values obtained in step S701 are multiplied together to compute a product of probabilities.
- the number of invoices between the vendor and buyer is identified, and a scaling factor is calculated based on this number. In other words, this number is indicative of the size of the corpus of previously received data items which were used by the probability data and threshold data calculation unit 109 to generate the probability distributions for each predefined invoice feature.
- the scaling factor is a function that depends on the number of invoices between the vendor and buyer. The goal is to increase the impact of the variable as the number of invoices increases. If there are fewer invoices, the effect is down- weighted, and vice versa.
- the value of the scaling function approaches 1 asymptotically as the number of invoices increases, and the rate at which it approaches 1 is controlled by a rate parameter 'a'. In this way, the use of larger corpora of previously received invoice data items to generate the probability distributions leads to a higher scaling factor, reflecting increased confidence in the classification.
- step S704 the scaling factor determined in step S703 is applied to the product of probabilities from step S702, resulting in the confidence score.
- step S705 the calculated confidence score is communicated to the user device, along with the corresponding RTT classification data.
- Figure 8 provides a diagram depicting an example of the scaling function that depends on the number of invoices (denoted as 'x') between the vendor and buyer.
- the graph demonstrates how the value of the scaling function approaches 1 asymptotically as the number of invoices increases.
- the x-axis the number of invoices between the vendor and buyer is plotted, while the y-axis represents the value of the scaling function.
- the curve in the graph illustrates how the effect of the variable is down-weighted for fewer invoices and increases as the number of invoices grows.
- Figure 9 provides a simplified schematic diagram depicting an example implementation of a system of the type described with reference to Figure 1 .
- Figure 9 illustrates an accounting services system 901 that runs software and data management systems, offering a variety of accounting services 902. These services may include, for example, payroll management, tax preparation and filing, financial reporting, bookkeeping, and cash flow analysis, among others.
- the accounting services system 901 also operates software and data management systems that provide an invoice processing service 903.
- This service comprises a historical invoice analysis module 904, an invoice classification module 905, and a payment processing module 906, which operate in accordance with the historical invoice analysis module 102, invoice classification module 103, and payment processing module 104 described above.
- the accounting services system 901 is configured to provide the accounting services 902 and the invoice processing services 903 to a plurality of users 907 connected to the accounting services system 901 via a data network 908, typically provided by the Internet.
- the accounting services 902 and the invoice processing service 903 can be delivered to the users 907 in a variety of suitable ways.
- each user 907 operates their own web browsing software, and the services provided by the accounting services system 901 are made available through a web-based interface or platform. Users can access the services by navigating to a dedicated website or web application, where they can securely log in with their credentials and interact with the system to perform various accounting and invoice processing tasks.
- the accounting services system 901 could offer these services through custom desktop or mobile applications, which users can download and install on their devices.
- the historical invoice analysis module 904, invoice classification module 905, and payment processing module 906 can be implemented using a variety of suitable computing techniques and methodologies known in the art.
- the implementation of these modules may involve hardware, software, or a combination of both, and can utilise various programming languages, algorithms, data structures, and processing approaches, as appropriate for the specific requirements of each module.
- modules can be deployed on a single computing device or distributed across multiple devices using cloud computing or other distributed computing techniques. This allows for flexibility in scaling the system according to demand and performance requirements. Furthermore, the modules can be designed to operate independently as separate modules, or they can be integrated and run together as a unified system to streamline the overall process.
- the historical invoice analysis module 904, invoice classification module 905, and payment processing module 906 can also be implemented in various ways, depending on the system architecture and user requirements. As illustrated in Figure 9, these modules can be implemented remotely on the accounting services system 901 , allowing users to access and utilise the services through a web-based interface or a dedicated application, facilitating centralised management, data security, and ease of updates.
- modules can be implemented directly on the user devices, enabling offline functionality and greater control over data and system customisation.
- the data storage provided by the historical invoice data database 107 and the probability data and threshold data storage unit 110 can be implemented in any suitable way as is known in the art.
- these storage systems may be configured as individual databases or as multiple distributed databases, employing various database technologies such as relational databases, NoSQL databases, or object-oriented databases.
- invoice features discussed include invoice amount, number of days since last invoice, and a combination of temporal values.
- any suitable invoice feature, derivable directly or indirectly from an invoice can be used.
- other potential invoice features may encompass invoice due date, invoice currency, tax rates applied, line-item descriptions, quantities and prices, vendor identification or categorisation, purchase order numbers, and shipping or delivery information, among others.
- an invoice data item is classified as being associated with a recurrently occurring and trusted transaction if it includes three independent values, where the probability of each of those values occurring exceeds the relevant predetermined threshold probability.
- the number of values may be different in different embodiments.
- an invoice data item is classified as being associated with a recurrently occurring and trusted transaction if it includes four, five, ten or any suitable integer of independent values, where the probability of each of those values occurring exceeds the relevant predetermined threshold probability.
- the classification data generated by the invoice classification module 103 is input to a payment processing module 104 to selectively initiate an automatic approval module 114 or manual approval module 115.
- the output of the classification module 103 can be used for additional or alternative processes within, for example, an accounting services system.
- the classification data can be input to a fraud detection module designed to identify potentially fraudulent activity or inconsistencies in billing practices. By analysing the classification data, this module can identify invoices that deviate from the expected patterns of recurrently occurring and trusted transactions for further investigation.
- the classification data could be input to an invoice categorisation module, which is responsible for organising and maintaining records of transactions based on their classification. This module can use the classification data to efficiently sort and track invoices, simplifying record-keeping and retrieval processes.
- the classification data can be used by a reconciliation module that automates the process of matching invoices with corresponding purchase orders or other related documents. By incorporating the classification data, this module can prioritise reconciliation tasks for invoices identified as recurrently occurring and trusted transactions, streamlining the overall accounting process.
- a transaction is classified as being associated with a recurrently occurring and trusted transaction by analysing an invoice data item.
- techniques in accordance with the invention can classify if other types of data items relating to transactions are associated with recurrently occurring and trusted transactions. For example, these may include electronic purchase orders, sales receipts, credit notes, debit notes, bank statements, financial statements, tax documents, payment confirmations, electronic funds transfer records, cheques, and so on.
- the automatic approval module 114 is activated to initiate a first type of transaction processing process based on the content of the invoice data item, specifically the automatic approval of the payment for the transaction to which the received invoice data item relates.
- the classification decision module 112 determines that a received invoice data item relates to a recurrently occurring and trusted transaction, and this is confirmed by a user
- the manual approval module 115 is activated to initiate a second type of transaction processing process based on the content of the invoice data item, specifically the manual approval of the payment for the transaction to which the received invoice data item relates.
- the steps associated with obtaining confirmation from a user that the RTT classification data generated by the classification decision module 112 is correct can be omitted.
- a first type of transaction processing process for example activation of the automatic approval module 114 is initiated solely on the basis of the classification decision module 112 classifying a received data item as being associated with a recurrently occurring and trusted transaction
- a second type of transaction processing process for example activation of the manual approval module 115 is initiated solely on the basis of the classification decision module 112 classifying a received data item as not being associated with a recurrently occurring and trusted transaction.
- processing an invoice can comprise relevant data being added to a general ledger and accounts payable database.
- full processing of the transaction associated with the invoice typically involves executing payment of the invoice itself.
- adding relevant data to the general ledger and accounts payable database may comprise only partial automation of a transaction process associated with an invoice if actual payment of the invoice must be manually instigated or otherwise overseen.
- the first transaction processing process (initiated when the classification decision module 112 determines that a received invoice data item relates to a recurrently occurring and trusted transaction) can comprise a fully automated transaction process in which, for example, an invoice is processed and payment initiated automatically.
- the first type of transaction processing process initiated when the classification data indicates that the received data item is associated with a recurrently occurring and trusted transaction, comprises an automated payment process.
- a payment associated with the received invoice data item (or any other suitable data item for authorising a payment) is automatically initiated, for example by interacting with a banking system.
- Figure 10 provides a simplified schematic diagram depicting an example implementation of such a scheme.
- Figure 10 depicts an automatic approval module 114 which otherwise operates as described with reference to Figure 1 and is disposed in relation to the invoice classification module 103 and other components of the invoice classification system 101 as shown in Figure 1 (these components are omitted from Figure 10 for clarity). However, in this example, the automatic approval module 114 is additionally connected via a suitable data connection to an API 1001 for communicating data to and from a banking computer system 1002.
- the automatic approval module 114 In use, in the event that the automatic approval module 114 receives an RTT classification from the classification decision module 112 indicating a received invoice data item is associated with a recurrently occurring and trusted transaction, the automatic approval module 114 is configured to generate payment authorisation data from the relevant payment data extracted from the invoice data item, including, for example, the payment amount, currency, recipient’s bank account details, and transaction ID. This data is then communicated to the API 1001 along with any necessary security-related data, such as encrypted API keys and session authentication tokens. The API 1001 then performs relevant security functions, such as validating and encrypting the incoming data and communicates appropriately formatted payment instructions, along with any necessary operational commands, to the banking computer system 1002. The banking computer system 1002 then verifies the authenticity of the transaction, debits the payer’s account, credits the recipient’s account, and confirms the transaction completion back to the API.
- payment authorisation data from the relevant payment data extracted from the invoice data item, including, for example, the payment amount, currency, recipient’s bank
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Accounting & Taxation (AREA)
- Computer Security & Cryptography (AREA)
- Finance (AREA)
- Strategic Management (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A computer implemented method for classifying received data items as being associated with recurrently occurring and trusted transactions. The method includes receiving a data item from a source and extracting a plurality of feature values from the data item. The method retrieves from storage a probability data set and a threshold data set associated with a source-recipient pair. For each extracted feature value, the method determines if the probability that the data item feature takes the extracted feature value exceeds the probability threshold using the corresponding probability distribution and probability threshold. If the probability exceeds the threshold for at least a predetermined proportion of the extracted feature values, the method generates classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
Description
DATA ITEM CLASSIFICATION
Technical Field
The present invention relates to techniques for classifying whether data items, such as electronic invoices, are associated with a particular type of transaction, specifically recurrent and trusted transactions.
Background
Substantial data processing and data storage efficiencies can be achieved in accounting software systems by identifying whether data items such as invoices are associated with "recurrently occurring and trusted transactions”. These are frequent and regular transactions between vendors and buyers that have a low probability of being irregular or unauthorised and are therefore “trusted”. Identifying invoices, or similar data items, related to these transactions within accounting software systems can greatly improve the efficiency of invoice processing, as invoice processing and payment can be expedited with reduced checking, or increased or complete automation, and bypassing more complex processing that is applied to other invoices.
However, current methods for identifying invoices associated with recurrently occurring and trusted transactions, as implemented in accounting software systems, have their limitations. Conventional techniques often employ rule-based solutions, which require the manual creation and maintenance of rules for each vendor. As the number of vendors increases, this approach becomes increasingly cumbersome and difficult to scale, leading to higher administrative costs and the potential for errors in rule application. Furthermore, rule-based solutions lack the flexibility to adapt to changes in invoice patterns, potentially causing inaccuracies in identifying trusted transactions within the accounting software systems.
Machine learning solutions using time-series forecasting have also been explored as a means of identifying recurrently occurring and trusted transactions within accounting software systems. While these solutions can be more adaptive, they may yield less reliable predictions when invoice data deviates even slightly from historical data. This sensitivity to small data changes can result in false positives or negatives, leading to inefficiencies in the invoice processing workflow. Additionally, these techniques typically necessitate manual threshold setting to some extent, which can further slow the process and introduce potential errors.
Therefore, there is a need for a more efficient and accurate approach to identify data items invoices related to recurrently occurring and trusted transactions, which can be easily scaled and adapted to accommodate the dynamic nature of business relationships and invoice data. This would enable faster and more reliable invoice processing within accounting software systems.
Summary of the Invention
In accordance with a first aspect of the invention, there is provided a computer implemented method for classifying received data items as being associated with recurrently occurring and trusted transactions. The method comprises the steps of: receiving a data item from a source; extracting a plurality of feature values from the data item, each feature value associated with one of a plurality of predefined data item features; retrieving from storage a probability data set and a threshold data set associated with a source-recipient pair, said source-recipient pair comprising the source and an intended recipient of the data item, wherein the probability data set includes a plurality of probability distributions, each probability distribution representing the likelihood that one of the plurality of predefined data item features will take a given value, each probability distribution based on a data set of data item feature values from a corpus of previously received data items from the source sent to the recipient, and the threshold data set comprises a plurality of probability thresholds, each defining a probability threshold defining a probability relative to one of the probability distributions; and for each extracted feature value: determining, using the corresponding probability distribution and probability threshold, if a probability that the data item feature takes the extracted feature value exceeds the probability threshold, and if, for at least a predetermined proportion of the extracted feature values, the probability that the data item feature takes the extracted feature value exceeds the probability threshold, generating classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
Optionally, the method further comprises the steps of: transmitting the classification data to a user device; and receiving feedback data comprising a validation or a rejection of the classification from a user via the user device, wherein the classification data is updated based on the feedback data if a rejection is received from the user via the user device.
Optionally, upon receiving a validation or rejection of the classification in the feedback data from the user via the user device, the method further comprises the steps of: updating the probability thresholds based on the feedback data; and storing the updated probability thresholds in the storage for use in classifying future received data items.
Optionally, the plurality of probability distributions comprises a plurality of probability density functions.
Optionally, the probability threshold for each probability distribution is determined based on the entropy of the data set of data item feature values to which the probability distribution relates.
Optionally, each probability density function is derived by applying a kernel density estimation function to the data set of data item feature values to which the probability density function relates.
Optionally, each probability threshold defines at least one region of highest probability density of the respective probability density function, and for each extracted feature value the step of determining, using the corresponding probability distribution and probability threshold, if a probability that the data item feature takes the extracted feature value exceeds the probability threshold comprises: determining if the extracted feature value is associated with a probability within a region of highest probability density of the respective probability distribution function defined by the probability threshold.
Optionally, the method further comprises the steps of, for each extracted feature value: calculating a probability value, using the corresponding probability distribution, indicative of a probability that the associated data item feature will take that value; and generating a score for the received data item when it is classified to be trusted using the calculated probability values.
Optionally, the step of generating a score for the received data item comprises summing the calculated probability values multiplied by a scaling factor, wherein the scaling factor is based on a size of the corpus of previously received data items from the source sent to the recipient, wherein larger corpora lead to a higher scaling factor, reflecting increased confidence in the classification.
Optionally, the data item received from the source is an invoice data item; the plurality of predefined data item features are predefined invoice features, and each probability distribution is based on a data set of invoice data item feature values from a corpus of previously received invoice data items from the source sent to the recipient.
Optionally, if, for all of the extracted feature values, the probability that the data item feature takes the extracted feature value exceeds the probability threshold, the method comprises generating classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
Optionally, the method further comprises initiating a first transaction processing process based on content of the received data item if the classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
Optionally, the method further comprises initiating a second transaction processing process based on the content of the received data item if the classification data indicative of the received data item not being associated with a recurrently occurring and trusted transaction.
Optionally, the first transaction processing process is an automated or partially automated transaction process.
In accordance with a second aspect of the invention there is provided a system for classifying received data items as being associated with recurrently occurring and trusted transactions. The system comprises storage on which is stored a plurality of probability data sets and threshold data sets, each probability data set and threshold data set associated with a source-recipient pair, each probability data set comprising a plurality of probability distributions, each probability distribution representing the likelihood that one of a plurality of predefined data item features will take a given value based on a data set of data item feature values from a corpus of previously received data items from the source sent to the recipient with which the probability data set is associated, and each threshold data set comprising a plurality of probability thresholds, each defining a probability threshold defining a probability relative to one of the probability distributions. The system further comprises a data item classification module comprising a data item processing module and a classification decision module, wherein said data item processing module is configured to: receive a data item from a source; extract a plurality of feature values from the data item, each value associated with one of a plurality of predefined data item features; retrieve from the storage a probability data set and a threshold data set associated with a source-recipient pair, said sourcerecipient pair comprising the source and an intended recipient of the data item, and the classification decision module is configured, for each extracted feature value to: determine, using the corresponding probability distribution and probability threshold, if a probability that the data item feature takes the extracted value exceeds the probability threshold, and if, for at least a predetermined proportion of the extracted feature values, the probability that the data item feature takes the extracted value exceeds the probability threshold, the classification decision module is configured to generate classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
In accordance with a third aspect of the invention there is provided a data item classification module for use in a system according to the second aspect, comprising a data item processing module and a classification decision module, wherein said data item processing module is configured to: receive a data item from a source; extract a plurality of feature values from the data item, each feature value associated with one of a plurality of predefined data item features; retrieve from storage a probability data set and a threshold data set associated with a source-recipient pair, said source-recipient pair comprising the source and an intended recipient of the data item, wherein the probability data set includes a plurality of probability distributions, each probability distribution representing the likelihood that one of the plurality of predefined data item features will take a given value, each probability distribution based on a data set of data item feature values from a corpus
of previously received data items from the source sent to the recipient, and the threshold data set comprises a plurality of probability thresholds, each defining a probability threshold defining a probability relative to one of the probability distributions, and the classification decision module is configured to: for each extracted feature value: determine, using the corresponding probability distribution and probability threshold, if a probability that the data item feature takes the extracted value exceeds the probability threshold, and if, for at least a predetermined proportion of the extracted feature values, the probability that the data item feature takes the extracted feature value exceeds the probability threshold, generate classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
In accordance with a fourth aspect of the invention there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of the first aspect.
In accordance with a fifth aspect of the invention there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of the first aspect.
In accordance with embodiments of the invention, a technique of classifying whether a received document is associated with a recurrently occurring and trusted transaction is provided.
Instead of relying on rule-based solutions or time-series forecasting, the invention uses probability distributions and probability thresholds to determine whether a data item, such as an electronic invoice is associated with a recurrently occurring and trusted transaction. The method extracts a plurality of feature values from a received data item, each feature value associated with one of a predefined set of data item features. It then retrieves probability data sets and threshold data sets associated with a source-recipient pair, comprising the source and the intended recipient of the document.
The probability data set includes multiple probability distributions, each representing the likelihood that one of the predefined document features will take a given value. These distributions are based on a data set of data item feature values from a corpus of previously received data items from the source sent to the recipient. The threshold data set comprises a plurality of probability thresholds, each defining a probability relative to one of the probability distributions.
For each extracted feature value, it is determined if the probability that the data item feature takes the extracted feature value exceeds the probability threshold. If this is true for at least a predetermined proportion of the extracted feature values, the method generates classification data
indicative of the received document being associated with a recurrently occurring and trusted transaction.
This classification can then be used to initiate a suitable process, for example an automated or semi-automated invoice payment process.
This invention offers several advantages over previous methods by overcoming the limitations of rule-based solutions and time-series forecasting techniques. The technique can be easily expanded to handle a large number of vendors and transactions without the need for manually creating and maintaining rules, making it a more efficient solution for identifying regular and trustworthy transactions. Additionally, it can adapt to changes in invoice patterns more effectively, as it uses probability distributions and thresholds instead of fixed rules, providing a more flexible approach.
Further, in accordance with embodiments of the invention, the accuracy of classification is improved due to the use of probability distributions and thresholds in determining whether a document is associated with a recurrently occurring and trusted transaction. By basing the classification on the likelihood of specific document features taking certain values, the technique captures the patterns and variations in the invoices more effectively than rule-based solutions or time-series forecasting. This probabilistic approach allows the system to better identify and adapt to invoice patterns and changes, leading to fewer false positives and negatives. As a result, the classification of data items as being associated with recurrently occurring and trusted transactions is more accurate, contributing to a more efficient and reliable processing workflow.
Moreover, the technique automates the classification of data items, reducing the need for manual threshold setting and rule maintenance, ultimately simplifying, for example, an invoice processing workflow.
Various further features and aspects of the invention are defined in the claims.
Brief Description of the Drawings
Embodiments of the present invention will now be described by way of example only with reference to the accompanying drawings where like parts are provided with corresponding reference numerals and in which:
Figure 1 provides a simplified schematic diagram depicting a system in which a technique for classifying invoice data items as being associated with recurrently occurring and trusted transactions is implemented, in accordance with certain embodiments of the invention;
Figure 2 provides a diagram depicting a process undertaken by the system depicted in Figure 1 , in accordance with certain embodiments of the invention;
Figure 3 provides a diagram depicting a process for classifying an invoice data item as being associated with a recurrently occurring and trusted transaction, in accordance with certain embodiments of the invention;
Figure 4 provides a diagram depicting a process for generating a probability distribution, in accordance with certain embodiments of the invention;
Figure 5 provides a diagram depicting a process of generating a probability threshold, in accordance with certain embodiments of the invention;
Figure 6A and 6B depict probability distributions of the occurrence of feature values in invoice data items, in accordance with certain embodiments of the invention;
Figure 7 provides a diagram depicting a process for generating a confidence score, in accordance with certain embodiments of the invention;
Figure 8 provides a diagram depicting the variance of a scaling factor used to generate a confidence score, in accordance with certain embodiments of the invention;
Figure 9 provides a simplified schematic diagram depicting an example system in which techniques in accordance with certain embodiments of the invention can be implemented, and
Figure 10 provides a simplified schematic diagram depicting an implementation of a scheme for automatically initiating a payment associated with a received invoice data item.
Detailed Description
Figure 1 provides a simplified schematic diagram of an invoice classification system 101 , arranged in accordance with certain embodiments of the invention.
The system is arranged to receive an electronic invoice issued by a vendor and sent to a buyer and then classify whether or not the invoice relates to a recurrently occurring and trusted transaction (RTT). This classification is based on a probabilistic analysis of values that certain features of the invoice take with respect to invoices that have been previously sent from the vendor to the buyer. If it is determined that the invoice relates to a recurrently occurring and trusted transaction, the invoice is forwarded for payment processing to an automated payment process. On the other hand, if it is determined that the invoice does not relate to a recurrently occurring and trusted transaction, the invoices forwarded to a payment process that involves a manual review.
The system includes a historical invoice analysis module 102, an invoice classification module 103, and a payment processing module 1O4.These components are typically integrated into a larger accounting services system which provides accounting services to a user.
Figure 1 also illustrates a source computer system 105, which sends an invoice to the invoice classification module 103 through a network 106.
The electronic invoice can take the form of any suitable electronic data item, for example an email, a PDF document, Microsoft Word or Excel files, image files (such as JPEG or PNG), plain text files, or other electronic file formats such as CSV, XML, or JSON.
As the skilled person will appreciate, the invoice can be communicated to the invoice classification module 103 in any suitable way depending, for example, on the way in which the accounting services system provides accounting services to the user device 117. For example, the invoice can be sent directly from the source computer system 105 to the invoice classification module 103, for example, as an attachment to an email. Alternatively, the source computer system 105 may send the invoice as an attachment to an email to the user device 117 which is then forwarded by software running on the user device 117 to the invoice classification module 103. Alternatively, the user device 117 may receive the invoice from the source computer system 105 and upload the invoice via a suitable interface for processing to the invoice classification module 103.
The historical invoice analysis module 102 comprises a historical invoice data database 107, on which is stored invoices previously sent from various vendors to various buyers. The historical invoice analysis module 102 further includes an invoice feature analysis function 108, a probability
data and threshold data calculation unit 109, and a probability data and threshold data storage unit 110.
The invoice classification module 103 comprises an incoming invoice processing unit 111 and a classification decision module 112. Additionally, the invoice classification module 103 features a reinforcement learning function 113.
The payment processing module 104 consists of an automatic approval module 114, a manual approval module 115, and an approved payment database 116 (provided specifically by a general ledger and accounts payable database).
Typically, the historical invoice analysis module 102 is substantially an offline module, configured to generate and store probability data and threshold data ahead of the operation of invoice classification module 103. The invoice classification module 103 is an online module, operating substantially in "real-time."
In use, on a vendor-buyer pair by vendor-buyer pair basis, the historical invoice analysis module 102 analyses previously exchanged invoices to determine the frequency with which certain invoice features take certain values in the previously issued invoices. This information is then used to calculate a probability distribution for each invoice feature. The probability distribution shows the distribution of probabilities that each invoice feature will take a specific value in future invoices exchanged between the particular vendor-buyer pair.
A probability threshold is then determined. This threshold is set so that, with reference to the relevant probability threshold, if a probability of an invoice feature value occurring exceeds the threshold probability, then, this is suggestive of the invoice being an invoice relating to a recurrently occurring and trusted transaction. If this is the case for multiple feature values, this is sufficient to classify the invoice as relating to a recurrently occurring and trusted transaction.
An invoice received by the invoice classification module 103 is then analysed to identify the value that each predetermined invoice feature takes in that invoice, and the probability distributions are then used to determine whether or not the probability of these feature values occurring exceeds the relevant probability thresholds. If a probability criterion is met, typically that the probability thresholds are exceeded for all the identified feature values, then the invoice is classed as a recurrently occurring and trusted transaction invoice and forwarded for automatic processing.
An example of the offline operation of the historical invoice analysis module 102 will now be described in detail. The historical invoice data database 107 has stored thereon data associated
with multiple vendor-buyer pairs. This data is arranged into a plurality of different corpuses, each corpus comprising a plurality of invoices associated with a particular vendor-buyer pair.
For each corpus, the invoice feature analysis function 108 is configured to process each invoice to locate the predetermined invoice features, and then identify the values that these features take. A data set of these feature values is then generated.
The predefined invoice features are typically selected on the basis that the values that they take are indicative of whether or not a transaction is a recurrently occurring and trusted transaction.
Invoice features that work optimally in this regard include those that tend to have values that are the same or similar across most or all recurring trusted transactions but vary across the broader corpus of invoices that also contains invoices which do not relate to recurrently occurring and trusted transactions. Examples include “invoice amount” - i.e. the numerical amount of money associated with a transaction because recurring trusted transactions between a given vendor and buyer typically have the same or similar value; and “number of days since last issue invoice issued” because invoices associated with recurring trusted transactions, will typically be issued at regularly spaced intervals.
Another suitable invoice feature is a “temporal” value combination. In other words, a combination of temporal-related values, for example, the combination of issue date (e.g. 2), issue day of the week (e.g. Monday), issue month (e.g. January) and issue year (e.g. 2023). Whilst an invoice feature comprising a combination of such temporal values might not necessarily be identical or correlate strongly across recurrently occurring and trusted transactions, such a combination of temporal features may assist in identifying unusual invoices which correlate negatively with being a recurring trusted transaction.
The invoice feature analysis function 108 is configured to communicate a data set indicative of the values of the invoice features identified in the corpus of invoice data items to the probability data and threshold data calculation unit 109. For example, with reference to the invoice features described above, this would include a data set comprising all the extracted invoice-amount values; a data set comprising all the extracted number-of-days-since-last-invoice-issued values and all of the temporal-related value combination values.
The probability data and threshold data calculation unit 109 is configured to then perform a probability distribution algorithm to this data to generate for each invoice feature a probability distribution showing the distribution of probabilities that the predefined invoice feature will take a range of different values in invoices exchanged between the vendor-buyer pair.
With reference to the invoice features described above, this would comprise three probability distributions per vendor-buyer pair, namely: a probability distribution of invoice-amount values; a probability distribution of days-since-last invoice-issued values and a probability distribution for each identified combination of temporal-related values.
The probability data and threshold data calculation unit 109 then applies a threshold setting algorithm to each probability distribution to generate a probability threshold for each probability distribution.
As is explained in more detail below, the probability setting algorithm is configured to determine a probability threshold for each probability distribution. If an invoice feature value appears in an invoice which the probability distribution indicates has a probability of occurring that exceeds this threshold, this is indicative of the transaction to which the invoice on which the invoice appears, being a recurrent invoice. As will be described in more detail below, the presence of several such invoice feature values provides the basis on which a received invoice is classified as relating to a recurrently occurring and trusted transaction.
An example of the “online” operation of the invoice classification module 103 is now explained further with reference to the process flow depicted in Figure 2.
At a first step S201 , an invoice data item generated by the source computer system 105 is received via the network 106 at the invoice processing unit 111.
At a second step S202, the invoice processing unit 111 analyses the invoice data item to determine the buyer-vendor pair (source-recipient pair) with which it is associated, specifically to determine the vendor it is from and the buyer for whom it is intended. This can be done in any suitable way. For example, one approach is to use Optical Character Recognition (OCR) to extract text from the invoice data item and employ pattern matching or natural language processing techniques to identify the vendor and buyer information.
Another approach involves parsing structured data, such as XML or EDI, to extract the relevant fields directly and determine the vendor and buyer. If the invoice was sent as an email or as an attachment in an email, analysing email metadata, including the sender, recipient, or subject line, can be used to deduce the vendor and buyer.
Additionally, searching for specific keywords or phrases, like 'Invoice From', 'Bill To', or company names, can help identify the vendor and buyer. Relying on pre-defined templates is another option;
if the invoice follows a known template, the vendor and buyer information can be extracted based on the expected layout and data fields.
These methods can be used individually or in combination to improve the accuracy of determining the vendor-buyer pair associated with the invoice.
For example, the buyer (the intended recipient) may be an organisation associated with the user device 117 and the vendor (the source) may be an organisation associated with the source computer system 105.
Once this has been determined, at a third step S203, the invoice processing unit 111 loads the probability data (a probability data set) and the threshold data (a threshold data set) associated with the identified buyer-vendor pair from the probability data and threshold data storage unit 110.
As described above, in one example, the probability data may include a probability distribution associated with values the invoice-amount invoice feature could take; a probability distribution associated with values the number-of-days-since-last-invoice invoice feature could take, and a probability distribution associated with values the temporal values combination invoice feature could take.
At the fourth step S204, the invoice processing unit 111 analyses the received invoice data item to identify the predetermined invoice features and determine the value of these invoice features in the received invoice data item. For example, the invoice processing unit 111 may analyse the received invoice and determine a value for the invoice amount (for example $100); a value for the number of days since last invoice (for example 28) and a value for the temporal feature combination (for example issue date = 2; day = Wednesday; month = April; year = 2023).
The values for features such as the invoice amount and temporal features can be determined using text recognition algorithms if the invoice data item contains text data or Optical Character Recognition (OCR) technology if the invoice data item is an image file. In the latter case, OCR is used to convert the image data into text, which is then analysed to extract the required feature values.
However, in certain examples, the values cannot be derived directly from the invoice itself and must be determined with reference to additional data. For example, to calculate the number of days since the last invoice, the invoice processing unit 111 accesses an external data source containing the invoice history of the vendor, such as a database or spreadsheet. The system retrieves the date of receipt for the previous invoice from the vendor and compares it with the
current invoice's date of receipt. The difference in days between the two invoices is then determined, providing the value for the "days since the last invoice" feature.
At a fifth step S205, the feature values determined at the fourth step S204 and the probability data and threshold data loaded at the third step S203 are passed to the classification decision module 112.
At a sixth step S206, the classification decision module 112 performs a recurrently occurring and trusted transaction classification algorithm. Specifically, for each feature value identified at the fourth step S204, it is determined if the probability of that feature value occurring exceeds the relevant predetermined probability threshold.
For example, with reference to the feature values mentioned above, it is determined if: with respect to the probability distribution relating to the invoice-amount invoice feature, whether the probability of this feature taking the value “$100” exceeds the predetermined probability threshold relating to this feature; with respect to the probability distribution relating to the number-of-days-since-last-invoice, whether the probability of this feature taking the value “28” exceeds the predetermined probability threshold relating to this feature, and with respect to the probability distribution relating to the temporal feature combination, whether the probability of this feature taking the value “2; Wednesday; April; 2023” exceeds the predetermined probability threshold relating to this feature
If the probability calculated for each value identified at the fourth step exceeds each of the corresponding probability thresholds, then the RTT classification algorithm determines that the invoice relates to a recurrently occurring and trusted transaction.
At a seventh step, S207 the classification decision module 112 communicates the corresponding RTT classification data to the user device 117 via the network 106. This is data indicating whether or not the classification decision module 112 has classified the received invoice as relating to a recurrently occurring and trusted transaction.
On receipt of the RTT classification data, the user device 117 is configured to present, via an interface, the classification decision to the user and prompt the user to provide classification feedback data either confirming (validating) or rejecting the RTT classification. This classification data is then returned (transmitted) to the classification decision module 112 via the network 106.
If the classification decision module 112 has generated classification data indicating the received invoice data item relates to a recurrently occurring and trusted transaction and the user provides classification feedback data validating that the RTT classification data is correct, at an eighth step S208, the classification decision module 112 communicates this validated RTT classification data to the automatic approval module 114 of the payment processing module 104.
On receipt of this RTT classification data, which indicates that the received data item relates to a recurrently occurring and trusted transaction, the automatic approval module 114 initiates an automated approval of the payment for the transaction to which the received invoice data item relates. Typically, this includes extracting the relevant payment data from the invoice and forwarding this to the general ledgerand accounts payable database 116 and initiating the relevant processes for the vendor to get paid.
Similarly, although not shown in the example shown in Figure 2, as will be understood, if the classification decision module 112 generates classification data indicating that the received invoice data item does not relate to a recurrently occurring and trusted transaction, and the user provides classification feedback data validating that this RTT classification data is correct, then the classification decision module 112 communicates this validated RTT classification data to the manual approval module 115.
On receipt of this validated RTT classification data, which indicates that the received data item does not relate to a recurrently occurring and trusted transaction, the manual approval module 115 initiates a process whereby payment of the transaction to which the invoice relates must be manually approved, for example via a user of the user device 117. If this manual approval process results in the invoice being approved, the relevant payment data is communicated to the general ledger and accounts payable database 116.
However, if the classification feedback data from the user comprises a rejection of the RTT classification data, the RTT, the classification decision module 112 is configured to update the classification data based on this feedback data.
Thus, returning to the example shown in Figure 2, if the classification decision module 112 generates classification data indicating that the received invoice data item relates to a recurrently occurring and trusted transaction, but the user rejects the RTT classification by providing classification feedback data rejecting this classification - i.e. indicating that the RTT classification data is incorrect and the received invoice data item does not relate to a recurrently occurring and trusted transaction, the classification decision module 112 updates the classification data to indicate that the received invoice data item does not relate to a recurrently occurring and trusted
transaction, and at a ninth step S209, the classification decision module 112 communicates the updated RTT classification data (indicating a non-RTT classification) to the manual approval module 115.
On receipt of this classification data, the manual approval module 115 then initiates a process whereby payment of the transaction to which the invoice relates must be manually approved, for example via a user of the user device 117. If this manual approval process results in the invoice being approved, the relevant payment data is communicated to the general ledger and accounts payable database 116.
Similarly, although not shown in the example shown in Figure 2, in certain examples, if the classification decision module 112 generates classification data indicating that the received invoice data item does not relate to a recurrently occurring and trusted transaction, but the user rejects the RTT classification by providing classification feedback data rejecting this classification - i.e. indicating that the RTT classification data is incorrect and the received invoice data item does in fact relate to a recurrently occurring and trusted transaction, the classification decision module 112 updates the classification data to indicate that the received invoice data item does relate to a recurrently occurring and trusted transaction.
The classification decision module 112 then communicates the updated RTT classification data (indicating a RTT classification) to the automatic approval module 114 of the payment processing module 104. On receipt of the updated RTT classification data, which indicates that the received data item relates to a recurrently occurring and trusted transaction, the automatic approval module 114 initiates an automated approval of the payment for the transaction to which the received invoice data item relates.
At a tenth step S210, the reinforcement learning function 113 generates threshold update data indicative of the feedback data provided by the user. This threshold update data is communicated to the probability data and threshold data calculation unit 109.
At an eleventh step S211 , the probability data and threshold data calculation unit 109 identifies the threshold data associated with the buyer-vendor pair and updates it in accordance with the threshold update data, then stores the updated threshold data in the probability data and threshold data storage unit 110 for use in classifying future received data items.
The threshold update data generated by the reinforcement learning function 113 depends on the feedback data provided by the user. For example, if the feedback data indicates that the RTT classification algorithm has correctly classified the invoice, then the reinforcement learning function
may send no threshold update data and instead apply a positive weight to the thresholds associated with the relevant vendor-buyer pair to make it less likely that those thresholds will be adjusted.
On the other hand, if the feedback data indicates that the RTT classification algorithm has generated a false positive, unless the thresholds in questions are associated with a high degree of positive weighting (indicating that the mis-classification may be due to another factor) the reinforcement learning function 113 may send threshold update data to the probability data and threshold data calculation unit 109 to increase the threshold probability by a predetermined amount, to reduce the likelihood of future false positives.
Alternatively, if the feedback data indicates that the RTT classification algorithm has generated a false negative, unless the thresholds in questions are associated with a high degree of positive weighting the reinforcement learning function 113 may send threshold update data to the probability data and threshold data calculation unit 109 to decrease the threshold probabilities by a predetermined amount, to reduce the likelihood of future false negatives.
Figure 3 provides a diagram depicting an example implementation of the RTT classification algorithm described with reference to the sixth step S206.
At a first step S301 one of the identified feature values identified from the received invoice by the invoice processing unit 111 is loaded along with the corresponding probability distribution and the probability threshold associated with the invoice feature to which that identified feature value relates.
At a second step S302 it is determined, using the loaded probability distribution and probability threshold if a probability of the identified feature value of the invoice feature occurring exceeds the probability threshold.
If the probability does not exceed the probability threshold, at a third step S303 the invoice is classified as an invoice that does not relate to a recurrently occurring and trusted transaction. This is because, in this example, for an invoice data item to be classified as being associated with an recurrently occurring and trusted transaction, it is necessary for the probability of each feature value occurring to exceed the relevant predetermined threshold probability.
Accordingly, it is only necessary for the probability of one of the identified feature values of one of the invoice feature occurring to not exceed the relevant probability threshold for the invoice to be classified as not being associated with a recurrently occurring and trusted transaction.
It will be understood that in other examples, an invoice may be classified as being associated with a recurrently occurring and trusted transaction if the probability of only a predetermined proportion (rather than all) of the identified feature values occurring, exceeds the respective probability thresholds. For example, if the feature values for three predetermined invoice features are identified from an invoice it may be sufficient that the probability of two of those invoice feature values exceeding the probability threshold would be sufficient for the invoice to be classified as being associated with a recurrently occurring and trusted transaction.
Returning to Figure 3, if the probability does exceed the probability threshold, if there are further feature values to be analysed, then the algorithm returns to the first step S301 and loads the next identified feature value, corresponding probability distribution and corresponding probability threshold.
However, if there are no further feature values to be analysed, implying that each calculated probability has exceeded the corresponding probability threshold, at a fourth step S304 the invoice is classified as being associated with a recurrently occurring and trusted transaction.
Returning to Figure 1 , as described above, the probability data and threshold data calculation unit 109 is configured to perform a probability distribution algorithm to the data sets of invoice feature values it receives from the invoice feature analysis function 108. The probability distribution algorithm generates for each invoice feature a probability distribution indicative of a likelihood that the predefined invoice feature will take a given value.
Typically, the plurality of probability distributions comprises a plurality of probability density functions. Specifically, in typical embodiments, the probability distribution algorithm applies a KDE function to the data sets of invoice feature values to produce for each invoice feature a continuous probability distribution, in other words a probability density function. This is advantageous because sharp discontinuities in the input data are smoothed, allowing for more accurate representation and analysis of the underlying data structure. Further, as described in more detail below, thresholding can be done in terms of regions of highest probability density, which is computationally efficient. Moreover, the variables of the kernel used in the KDE function can be optimised for each different invoice feature. More specifically, for any given invoice feature, there will be an optimal kernel shape and optimal bandwidth which balances the smoothness of the distribution produced with sensitivity to the variance in the underlying data set. Use of a KDE function means that this bandwidth can be selected for each different invoice feature. For example, for the invoice feature “invoice amount”, an optimal kernel shape has been found to be a Gaussian kernel with a bandwidth of “5” where:
Where /^(x) is the computed KDE for each point x with contribution from point xL and K is the kernel used. The bandwidth of the kernel is h. For a Gaussian kernel, the kernel has the form
An example of such an algorithm is described further with reference to Figure 4.
Figure 4 provides a flow diagram depicting a probability distribution algorithm in accordance with certain examples of the invention.
At a first step S401 , a data set of feature values is received for a predefined data item feature which have been extracted by the invoice feature analysis function 108 as described above. For example, this data set of feature values may comprise a data set indicating all the values identified for the “invoice-amount” feature identified from a corpus of invoices relating to a particular vendorbuyer pair.
At a second step S402, a KDE function is applied to this data set. The KDE function, or Kernel Density Estimation, works by placing a kernel (a smooth, continuous function such as a Gaussian) at each data point in the dataset. The kernels are then summed, and the resulting function is normalised to produce a continuous probability distribution that represents the underlying structure of the data. This continuous distribution estimates the probability density of the data.
At a third step S403, the continuous probability distribution that is output by the KDE function is stored in the probability data and threshold data storage unit 110.
As described above, the probability distribution algorithm depicted in Figure 4 is repeated for each data set of extracted invoice feature values producing a probability distribution associated with each predefined invoice feature per vendor-buyer pair, for example: a probability distribution of invoice-amount values; a probability distribution of days-since-last-invoice issued values and a probability distribution for each identified combination of temporal features.
As described above, once the probability distribution algorithm has generated the probability distributions for each invoice feature, the probability data and threshold data calculation unit 109
then performs a threshold setting algorithm to each probability distribution to generate, for each probability distribution, a probability threshold.
As described in more detail below, in certain embodiments, this threshold is set as a function of the entropy of the data set of invoice feature values. This ensures that the threshold is sensitive to the underlying variability of the data set of invoice feature values, in particular accounting for the degree of uniformity in the data distribution. This is important because the more uniform the data set (i.e., the smaller the degree to which the values for the invoice feature vary from invoice to invoice), the less information is contained within the variation of the feature values, and the less likely that any of them are RTTs.
By setting the threshold as a function of the entropy, the algorithm is able to adjust the threshold according to the degree of uniformity in the data set. In cases where the data set is more uniform, the threshold will be set closer to the maximum probability density, ensuring that only the most significant variations in the feature values are considered for further analysis. Conversely, in cases where the data set exhibits a higher degree of variability, the threshold will be set lower, allowing for a broader range of values to be included in the analysis. This adaptive approach helps to reduce the likelihood of false positives and false negatives during invoice classification.
An example of such a threshold setting algorithm is described further with reference to Figure 5.
Figure 5 provides a flow diagram depicting a threshold setting algorithm based on the entropy ratio of the data distribution in accordance with certain examples of the invention.
At a first step S501 , the data set and its corresponding probability distribution, generated using the KDE function, are received.
At a second step S502, the entropy (Hd) of the data distribution of the data set is calculated.
At a third step S503, the entropy (Hu) of a uniform distribution with the same number of points as the underlying data distribution of the data set is calculated.
At a fourth step S504, the entropy ratio (Hr) is calculated using the formula Hr = Hd/Hu.
At a fifth step S505, the highest probability density (pmax) and the range of probability density values (Prange) are determined, measured as the difference between the maximum and minimum probability density values from the received probability distribution.
At a sixth step S506, the entropy ratio component (1 - Hr(p)) is computed for the probability distribution of the data set, which indicates how non-uniform the distribution is.
At a seventh step S507, the threshold (tf) is calculated using the formula tf = pmax - (1 - Hr(p)) x Prange. This threshold is set based on the entropy ratio, such that it is closer to the max value for more uniform distributions and further away for less uniform distributions.
At an eighth step S508, the computed threshold (tf) is stored.
The computed threshold tf is indicative of a region of highest probability density of the probability distribution which is described further with reference to Figure 6A.
Figure 6A provides a diagram depicting an example probability distribution, which is generated using the KDE function as described in the Figure 4 algorithm, and its corresponding threshold (tf). As will be understood, this probability distribution corresponds to a probability density function associated with the likelihood of an invoice data item feature taking a given value.
In Figure 6A, the x-axis represents the data values, and the y-axis represents the probability density values. The probability distribution is illustrated as a continuous Gaussian-like curve, resulting from the KDE function applied in the probability distribution algorithm described with reference to Figure 4.
The threshold (tf) is calculated using the entropy-based algorithm outlined in Figure 5. The threshold value is visually represented as a horizontal line on the probability distribution.
As can be understood with reference to Figure 6A, the threshold tf defines a region R of greatest probability density which is denoted by the shaded area under the curve. Data item feature values on the x-axis within this region represent the values with the highest probability densities and thus those that are most likely to occur.
Referring back to Figure 3, at the second step, S302 it is determined, using the loaded probability distribution and probability threshold value if a probability of the identified value of the data item feature occurring exceeds the probability threshold value. An example of how this can be performed can be understood with reference to Figure 6A.
As can be seen from Figure 6A, if a data item feature value has a value along with x-axis that falls within the region R, it is within the region of highest probability density defined by the threshold tf. Therefore, to determine if a probability of the identified value of the data item feature occurring
exceeds the probability threshold value, it is simply a matter of determining whether or not a data item feature value is within this region.
Using the region of highest probability density as set by the threshold offers a computationally efficient approach for determining if the probability threshold is exceeded. This method means that a single operation need be performed, i.e., comparing the identified value's probability against the threshold, rather than alternative techniques that might involve a sequence of more complex operations, for example, calculating individual probabilities for all possible values, conducting multiple hypothesis tests, or iteratively updating Bayesian models.
The example probability distribution shown in Figure 6A depicts a simple probability density function with a single peak. However, it will be understood that in certain examples, the probability density function may be multi-modal with two or more peaks some of which may have regions of highest probability density. In such examples, the probability density function defining the probability distribution may have multiple regions of highest probability density within which the probability of a value occurring may fall. Such a probability distribution might occur, for example, for the invoice data item feature of invoice-amount where recurrently occurring and trusted transactions typically take one of two or more distinct values, for example either $50 or $100, or some slightly perturbed values, for example $52 or $101 .
An example of a probability distribution comprising such a multi-modal probability density function is shown in Figure 6B. As can be seen from Figure 6B, if an invoice-amount value has a value that falls with a first region of highest density R1 centred around the value $50 or a value that falls with a second region of highest density R2 centred around the value $100 it is within a region of highest probability density defined by the threshold tf.
In the example described above, the probability distributions are generated in the form of continuous probability distributions, specifically probability density functions. However, in alternative embodiments, some or all of the probability distributions can be calculated using different techniques. For instance, the probability distributions can be represented as histograms or normalised histograms with a plurality of bins, each representing a discrete value that the invoice data item feature can take. The height of each bin corresponds to the frequency of occurrence of that particular invoice data item feature within the dataset of invoice feature values identified from the corpus of invoice data items.
In such examples, instead of the probability threshold representing a region of maximum probability density, the probability threshold may be set by considering the frequency of occurrence for each discrete value in the distribution. This can be done by selecting a suitable frequency threshold,
such that only those discrete values with a frequency higher than the threshold are considered likely or probable. In keeping with the example described above, this probability threshold can be set based on the entropy of the underlying dataset.
In certain embodiments of the present invention, the probability thresholds for determining likely or probable invoice feature values are set using a predetermined static factor, rather than employing entropy-based techniques. The static factor is applied differently based on the type of probability distribution associated with the invoice feature values.
For continuous probability distributions, such as probability density functions, the probability threshold is determined by multiplying the predetermined static factor by the maximum probability density. Conversely, for probability distributions consisting of discrete values, such as histograms or normalised histograms, the probability threshold is established by multiplying the predetermined static factor by the maximum probability value.
In some instances, a predetermined static factor of 0.9 is selected for setting the probability thresholds. This approach allows for a consistent and straightforward method of determining probability thresholds, which can be easily adjusted by altering the static factor to accommodate various degrees of confidence or specificity in the analysis of invoice features.
In certain embodiments, the classification decision module 112 is configured to perform a confidence score algorithm if the RTT classification algorithm determines that the invoice relates to a recurrently occurring and trusted transaction. This algorithm produces a confidence score output which is communicated to the user device 117 along with the corresponding RTT classification data (indicating whether the invoice has been classed as being associated with a recurrently occurring and trusted transaction). The confidence score is indicative of a degree of certainty associated with the classification. Software running on the user device 117 is configured to display the confidence score to the user when they are confirming or rejecting the RTT classification performed by the RTT classification algorithm. Providing a confidence score in this way improves decision-making by offering users additional information about the classification and helps prioritise manual review by indicating which invoices may need closer scrutiny based on the confidence score. Additionally, the provision of a confidence score may enhance user trust in the system by making the classification process more transparent.
Figure 7 provides a flow diagram depicting a confidence score calculation algorithm. At a first step S701 , the probability values for each identified invoice feature value of the received invoice data item are calculated using the relevant probability distribution. This information can be obtained during the second step S302 of the invoice classification algorithm.
At a second step S702, the probability values obtained in step S701 are multiplied together to compute a product of probabilities. At a third step S703, the number of invoices between the vendor and buyer is identified, and a scaling factor is calculated based on this number. In other words, this number is indicative of the size of the corpus of previously received data items which were used by the probability data and threshold data calculation unit 109 to generate the probability distributions for each predefined invoice feature. Thus, the scaling factor is a function that depends on the number of invoices between the vendor and buyer. The goal is to increase the impact of the variable as the number of invoices increases. If there are fewer invoices, the effect is down- weighted, and vice versa. The value of the scaling function approaches 1 asymptotically as the number of invoices increases, and the rate at which it approaches 1 is controlled by a rate parameter 'a'. In this way, the use of larger corpora of previously received invoice data items to generate the probability distributions leads to a higher scaling factor, reflecting increased confidence in the classification.
At a fourth step S704, the scaling factor determined in step S703 is applied to the product of probabilities from step S702, resulting in the confidence score. At a fifth step S705, the calculated confidence score is communicated to the user device, along with the corresponding RTT classification data.
Figure 8 provides a diagram depicting an example of the scaling function that depends on the number of invoices (denoted as 'x') between the vendor and buyer. The graph demonstrates how the value of the scaling function approaches 1 asymptotically as the number of invoices increases. On the x-axis, the number of invoices between the vendor and buyer is plotted, while the y-axis represents the value of the scaling function. The curve in the graph illustrates how the effect of the variable is down-weighted for fewer invoices and increases as the number of invoices grows.
Figure 9 provides a simplified schematic diagram depicting an example implementation of a system of the type described with reference to Figure 1 .
Figure 9 illustrates an accounting services system 901 that runs software and data management systems, offering a variety of accounting services 902. These services may include, for example, payroll management, tax preparation and filing, financial reporting, bookkeeping, and cash flow analysis, among others.
The accounting services system 901 also operates software and data management systems that provide an invoice processing service 903. This service comprises a historical invoice analysis module 904, an invoice classification module 905, and a payment processing module 906, which
operate in accordance with the historical invoice analysis module 102, invoice classification module 103, and payment processing module 104 described above.
The accounting services system 901 is configured to provide the accounting services 902 and the invoice processing services 903 to a plurality of users 907 connected to the accounting services system 901 via a data network 908, typically provided by the Internet.
The accounting services 902 and the invoice processing service 903 can be delivered to the users 907 in a variety of suitable ways. In typical embodiments, each user 907 operates their own web browsing software, and the services provided by the accounting services system 901 are made available through a web-based interface or platform. Users can access the services by navigating to a dedicated website or web application, where they can securely log in with their credentials and interact with the system to perform various accounting and invoice processing tasks. Alternatively, the accounting services system 901 could offer these services through custom desktop or mobile applications, which users can download and install on their devices.
The historical invoice analysis module 904, invoice classification module 905, and payment processing module 906 can be implemented using a variety of suitable computing techniques and methodologies known in the art. The implementation of these modules may involve hardware, software, or a combination of both, and can utilise various programming languages, algorithms, data structures, and processing approaches, as appropriate for the specific requirements of each module.
These modules can be deployed on a single computing device or distributed across multiple devices using cloud computing or other distributed computing techniques. This allows for flexibility in scaling the system according to demand and performance requirements. Furthermore, the modules can be designed to operate independently as separate modules, or they can be integrated and run together as a unified system to streamline the overall process.
The design and implementation of these modules are not limited to any particular technology or method and can be adapted to incorporate new developments or advancements in the field of computing, such as emerging cloud-based solutions or novel distributed computing paradigms, as long as the intended functionality and objectives are achieved.
The historical invoice analysis module 904, invoice classification module 905, and payment processing module 906 can also be implemented in various ways, depending on the system architecture and user requirements. As illustrated in Figure 9, these modules can be implemented remotely on the accounting services system 901 , allowing users to access and utilise the services
through a web-based interface or a dedicated application, facilitating centralised management, data security, and ease of updates.
Alternatively, the modules can be implemented directly on the user devices, enabling offline functionality and greater control over data and system customisation.
The data storage provided by the historical invoice data database 107 and the probability data and threshold data storage unit 110 can be implemented in any suitable way as is known in the art. For example, these storage systems may be configured as individual databases or as multiple distributed databases, employing various database technologies such as relational databases, NoSQL databases, or object-oriented databases.
In the examples described above, the invoice features discussed include invoice amount, number of days since last invoice, and a combination of temporal values. However, any suitable invoice feature, derivable directly or indirectly from an invoice, can be used. For example, other potential invoice features may encompass invoice due date, invoice currency, tax rates applied, line-item descriptions, quantities and prices, vendor identification or categorisation, purchase order numbers, and shipping or delivery information, among others.
In the examples described above, an invoice data item is classified as being associated with a recurrently occurring and trusted transaction if it includes three independent values, where the probability of each of those values occurring exceeds the relevant predetermined threshold probability. However, the number of values may be different in different embodiments. For example, in alternative embodiments, an invoice data item is classified as being associated with a recurrently occurring and trusted transaction if it includes four, five, ten or any suitable integer of independent values, where the probability of each of those values occurring exceeds the relevant predetermined threshold probability.
In the example described above, the classification data generated by the invoice classification module 103, indicating whether or not an invoice is associated with a recurrently occurring and trusted transaction, is input to a payment processing module 104 to selectively initiate an automatic approval module 114 or manual approval module 115. However, in alternative embodiments, the output of the classification module 103 can be used for additional or alternative processes within, for example, an accounting services system.
For instance, the classification data can be input to a fraud detection module designed to identify potentially fraudulent activity or inconsistencies in billing practices. By analysing the classification data, this module can identify invoices that deviate from the expected patterns of recurrently
occurring and trusted transactions for further investigation. Alternatively, or additionally, the classification data could be input to an invoice categorisation module, which is responsible for organising and maintaining records of transactions based on their classification. This module can use the classification data to efficiently sort and track invoices, simplifying record-keeping and retrieval processes. Alternatively, or additionally, the classification data can be used by a reconciliation module that automates the process of matching invoices with corresponding purchase orders or other related documents. By incorporating the classification data, this module can prioritise reconciliation tasks for invoices identified as recurrently occurring and trusted transactions, streamlining the overall accounting process.
In the examples described above, a transaction is classified as being associated with a recurrently occurring and trusted transaction by analysing an invoice data item. However, techniques in accordance with the invention can classify if other types of data items relating to transactions are associated with recurrently occurring and trusted transactions. For example, these may include electronic purchase orders, sales receipts, credit notes, debit notes, bank statements, financial statements, tax documents, payment confirmations, electronic funds transfer records, cheques, and so on.
In the example described above, in the event that the classification decision module 112 determines that a received invoice data item relates to a recurrently occurring and trusted transaction, and this is confirmed by a user, then the automatic approval module 114 is activated to initiate a first type of transaction processing process based on the content of the invoice data item, specifically the automatic approval of the payment for the transaction to which the received invoice data item relates.
On the other hand, in the event that the classification decision module 112 determines that a received invoice data item relates to a recurrently occurring and trusted transaction, and this is confirmed by a user, then the manual approval module 115 is activated to initiate a second type of transaction processing process based on the content of the invoice data item, specifically the manual approval of the payment for the transaction to which the received invoice data item relates.
In certain examples, the steps associated with obtaining confirmation from a user that the RTT classification data generated by the classification decision module 112 is correct (e.g. a user providing either a validation or rejection of the classification in the classification data) can be omitted. In such examples, a first type of transaction processing process (for example activation of the automatic approval module 114) is initiated solely on the basis of the classification decision module 112 classifying a received data item as being associated with a recurrently occurring and trusted transaction, and a second type of transaction processing process (for example activation
of the manual approval module 115) is initiated solely on the basis of the classification decision module 112 classifying a received data item as not being associated with a recurrently occurring and trusted transaction.
As noted above, processing an invoice can comprise relevant data being added to a general ledger and accounts payable database. However, full processing of the transaction associated with the invoice typically involves executing payment of the invoice itself. In this way, adding relevant data to the general ledger and accounts payable database may comprise only partial automation of a transaction process associated with an invoice if actual payment of the invoice must be manually instigated or otherwise overseen. However, in certain examples, the first transaction processing process (initiated when the classification decision module 112 determines that a received invoice data item relates to a recurrently occurring and trusted transaction) can comprise a fully automated transaction process in which, for example, an invoice is processed and payment initiated automatically.
Specifically, in certain examples, (either with or without user validation), the first type of transaction processing process, initiated when the classification data indicates that the received data item is associated with a recurrently occurring and trusted transaction, comprises an automated payment process. In such examples, a payment associated with the received invoice data item (or any other suitable data item for authorising a payment) is automatically initiated, for example by interacting with a banking system.
Figure 10 provides a simplified schematic diagram depicting an example implementation of such a scheme.
Figure 10 depicts an automatic approval module 114 which otherwise operates as described with reference to Figure 1 and is disposed in relation to the invoice classification module 103 and other components of the invoice classification system 101 as shown in Figure 1 (these components are omitted from Figure 10 for clarity). However, in this example, the automatic approval module 114 is additionally connected via a suitable data connection to an API 1001 for communicating data to and from a banking computer system 1002.
In use, in the event that the automatic approval module 114 receives an RTT classification from the classification decision module 112 indicating a received invoice data item is associated with a recurrently occurring and trusted transaction, the automatic approval module 114 is configured to generate payment authorisation data from the relevant payment data extracted from the invoice data item, including, for example, the payment amount, currency, recipient’s bank account details, and transaction ID. This data is then communicated to the API 1001 along with any necessary
security-related data, such as encrypted API keys and session authentication tokens. The API 1001 then performs relevant security functions, such as validating and encrypting the incoming data and communicates appropriately formatted payment instructions, along with any necessary operational commands, to the banking computer system 1002. The banking computer system 1002 then verifies the authenticity of the transaction, debits the payer’s account, credits the recipient’s account, and confirms the transaction completion back to the API.
All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features. The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed. With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity. It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims are generally intended as “open” terms (e.g., the term “including” should be interpreted as “including but not limited to,” the term “having” should be interpreted as “having at least,” the term “includes” should be interpreted as “includes but is not limited to,” etc.). It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation no such intent is present. For example, as an aid to understanding, the following appended claims may contain usage of the introductory phrases "at least one" and "one or more" to introduce claim recitations. However, the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles "a" or "an" limits any particular claim containing such introduced claim recitation to embodiments containing only one such recitation, even when the same claim includes the introductory phrases "one or more" or "at least one" and indefinite articles such as "a" or "an" (e.g., “a” and/or “an” should be interpreted to mean “at least one” or “one or more”); the same holds true for the use of definite articles used to introduce claim recitations. In addition, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should be interpreted to mean at least the recited number (e.g., the bare recitation of "two recitations," without other modifiers, means at least two
recitations, or two or more recitations). It will be appreciated that various embodiments of the present disclosure have been described herein for purposes of illustration, and that various modifications may be made without departing from the scope of the present disclosure. Accordingly, the various embodiments disclosed herein are not intended to be limiting, with the true scope being indicated by the following claims.
Claims
1 . A computer implemented method for classifying received data items as being associated with recurrently occurring and trusted transactions, the method comprising the steps of: receiving a data item from a source; extracting a plurality of feature values from the data item, each feature value associated with one of a plurality of predefined data item features; retrieving from storage a probability data set and a threshold data set associated with a source-recipient pair, said source-recipient pair comprising the source and an intended recipient of the data item, wherein the probability data set includes a plurality of probability distributions, each probability distribution representing the likelihood that one of the plurality of predefined data item features will take a given value, each probability distribution based on a data set of data item feature values from a corpus of previously received data items from the source sent to the recipient, and the threshold data set comprises a plurality of probability thresholds, each defining a probability threshold defining a probability relative to one of the probability distributions; and for each extracted feature value: determining, using the corresponding probability distribution and probability threshold, if a probability that the data item feature takes the extracted feature value exceeds the probability threshold, and if, for at least a predetermined proportion of the extracted feature values, the probability that the data item feature takes the extracted feature value exceeds the probability threshold, generating classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
2. A method according to claim 1 , further comprising the steps of: transmitting the classification data to a user device; and receiving feedback data comprising a validation or a rejection of the classification from a user via the user device, wherein the classification data is updated based on the feedback data if a rejection is received from the user via the user device.
3. A method according to claim 2, wherein, upon receiving a validation or rejection of the classification in the feedback data from the user via the user device, the method further comprises the steps of: updating the probability thresholds based on the feedback data; and storing the updated probability thresholds in the storage for use in classifying future received data items.
4. A method according to any previous claim, wherein the plurality of probability distributions comprises a plurality of probability density functions.
5. A method according to claim 4, wherein the probability threshold for each probability distribution is determined based on the entropy of the data set of data item feature values to which the probability distribution relates.
6. A method according to claim 4 or 5, wherein each probability density function is derived by applying a kernel density estimation function to the data set of data item feature values to which the probability density function relates.
7. A method according to any of claims 4 to 6, wherein each probability threshold defines at least one region of highest probability density of the respective probability density function, and for each extracted feature value the step of determining, using the corresponding probability distribution and probability threshold, if a probability that the data item feature takes the extracted feature value exceeds the probability threshold comprises: determining if the extracted feature value is associated with a probability within a region of highest probability density of the respective probability distribution function defined by the probability threshold.
8. A method according to claim 1 , further comprising the steps of, for each extracted feature value: calculating a probability value, using the corresponding probability distribution, indicative of a probability that the associated data item feature will take that value; and generating a score for the received data item when it is classified to be trusted using the calculated probability values.
9. A method according to claim 8, wherein the step of generating a score for the received data item comprises summing the calculated probability values multiplied by a scaling factor, wherein the scaling factor is based on a size of the corpus of previously received data items from the source sent to the recipient, wherein larger corpora lead to a higher scaling factor, reflecting increased confidence in the classification.
10. A method according to any previous claim, wherein the data item received from the source is an invoice data item; the plurality of predefined data item features are predefined invoice features, and each probability distribution is based on a data set of invoice data item feature values from a corpus of previously received invoice data items from the source sent to the recipient.
11. A method according to any previous claim, wherein if, for all of the extracted feature values, the probability that the data item feature takes the extracted feature value exceeds the probability threshold, the method comprises generating classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
12. A method according to any previous claim, further comprising initiating a first transaction processing process based on content of the received data item if the classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
13. A method according to claim 12, further comprising initiating a second transaction processing process based on the content of the received data item if the classification data indicative of the received data item not being associated with a recurrently occurring and trusted transaction.
14. A method according to claim 12 or 13, wherein the first transaction processing process is an automated or partially automated transaction process.
15. A system for classifying received data items as being associated with recurrently occurring and trusted transactions, the system comprising storage on which is stored a plurality of probability data sets and threshold data sets, each probability data set and threshold data set associated with a source-recipient pair, each probability data set comprising a plurality of probability distributions, each probability distribution representing the likelihood that one of a plurality of predefined data item features will take a given value based on a data set of data item feature values from a corpus of previously received data items from the source sent to the recipient with which the probability data set is associated, and each threshold data set comprising a plurality of probability thresholds, each defining a probability threshold defining a probability relative to one of the probability distributions, and a data item classification module comprising a data item processing module and a classification decision module, wherein said data item processing module is configured to: receive a data item from a source; extract a plurality of feature values from the data item, each value associated with one of a plurality of predefined data item features; retrieve from the storage a probability data set and a threshold data set associated with a source-recipient pair, said source-recipient pair comprising the source and an intended recipient of the data item, and the classification decision module is configured, for each extracted feature value to:
determine, using the corresponding probability distribution and probability threshold, if a probability that the data item feature takes the extracted value exceeds the probability threshold, and if, for at least a predetermined proportion of the extracted feature values, the probability that the data item feature takes the extracted value exceeds the probability threshold, the classification decision module is configured to generate classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
16. A data item classification module for use in a system according to claim 15, comprising a data item processing module and a classification decision module, wherein said data item processing module is configured to: receive a data item from a source; extract a plurality of feature values from the data item, each feature value associated with one of a plurality of predefined data item features; retrieve from storage a probability data set and a threshold data set associated with a source-recipient pair, said source-recipient pair comprising the source and an intended recipient of the data item, wherein the probability data set includes a plurality of probability distributions, each probability distribution representing the likelihood that one of the plurality of predefined data item features will take a given value, each probability distribution based on a data set of data item feature values from a corpus of previously received data items from the source sent to the recipient, and the threshold data set comprises a plurality of probability thresholds, each defining a probability threshold defining a probability relative to one of the probability distributions, and the classification decision module is configured to: for each extracted feature value: determine, using the corresponding probability distribution and probability threshold, if a probability that the data item feature takes the extracted value exceeds the probability threshold, and if, for at least a predetermined proportion of the extracted feature values, the probability that the data item feature takes the extracted feature value exceeds the probability threshold, generate classification data indicative of the received data item being associated with a recurrently occurring and trusted transaction.
17. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of claim 1 .
18. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of claim 1 .
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363522560P | 2023-06-22 | 2023-06-22 | |
| US63/522,560 | 2023-06-22 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024261225A1 true WO2024261225A1 (en) | 2024-12-26 |
Family
ID=91664994
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2024/067408 Pending WO2024261225A1 (en) | 2023-06-22 | 2024-06-21 | Data item classification |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024261225A1 (en) |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080281733A1 (en) * | 2005-06-06 | 2008-11-13 | First Data Corporation | System and method for authorizing electronic payment transactions |
| US8812419B1 (en) * | 2010-06-12 | 2014-08-19 | Google Inc. | Feedback system |
| US20180308025A1 (en) * | 2017-04-20 | 2018-10-25 | Capital One Services, Llc | Machine learning artificial intelligence system for predicting popular hours |
| US20200242525A1 (en) * | 2019-01-30 | 2020-07-30 | EMC IP Holding Company LLC | Automated generation of adaptive policies from organizational data for detection of risk-related events |
| US20210158355A1 (en) * | 2016-03-25 | 2021-05-27 | State Farm Mutual Automobile Insurance Company | Preempting or resolving fraud disputes relating to billing aliases |
| JP7040851B2 (en) * | 2018-03-09 | 2022-03-23 | 株式会社インテック | Anomaly detection device, anomaly detection method and anomaly detection program |
| CN115481424A (en) * | 2021-05-31 | 2022-12-16 | 阿里巴巴新加坡控股有限公司 | Cross-domain self-adaption method and data processing method of detection model |
-
2024
- 2024-06-21 WO PCT/EP2024/067408 patent/WO2024261225A1/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080281733A1 (en) * | 2005-06-06 | 2008-11-13 | First Data Corporation | System and method for authorizing electronic payment transactions |
| US8812419B1 (en) * | 2010-06-12 | 2014-08-19 | Google Inc. | Feedback system |
| US20210158355A1 (en) * | 2016-03-25 | 2021-05-27 | State Farm Mutual Automobile Insurance Company | Preempting or resolving fraud disputes relating to billing aliases |
| US20180308025A1 (en) * | 2017-04-20 | 2018-10-25 | Capital One Services, Llc | Machine learning artificial intelligence system for predicting popular hours |
| JP7040851B2 (en) * | 2018-03-09 | 2022-03-23 | 株式会社インテック | Anomaly detection device, anomaly detection method and anomaly detection program |
| US20200242525A1 (en) * | 2019-01-30 | 2020-07-30 | EMC IP Holding Company LLC | Automated generation of adaptive policies from organizational data for detection of risk-related events |
| CN115481424A (en) * | 2021-05-31 | 2022-12-16 | 阿里巴巴新加坡控股有限公司 | Cross-domain self-adaption method and data processing method of detection model |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11710332B2 (en) | Electronic document data extraction | |
| US12271873B2 (en) | Systems and methods for improving error tolerance in processing an input file | |
| US11610271B1 (en) | Transaction data processing systems and methods | |
| US20230132208A1 (en) | Systems and methods for classifying imbalanced data | |
| US20240211967A1 (en) | Adaptive transaction processing system | |
| US20190236601A1 (en) | Enhanced merchant identification using transaction data | |
| AU2019203386A1 (en) | Data validation | |
| US20130073386A1 (en) | Systems and methods for generating financial institution product offer proposals | |
| US12141612B2 (en) | Resource enhancement process as a function of resource variability based on a resource enhancement metric | |
| US20210035119A1 (en) | Method and system for real-time automated identification of fraudulent invoices | |
| US20230385844A1 (en) | Granting provisional credit based on a likelihood of approval score generated from a dispute-evaluator machine-learning model | |
| US12106281B1 (en) | Systems and methods for accounts payable-based batch processing | |
| US20240062099A1 (en) | Vectorized fuzzy string matching process | |
| US20240028965A1 (en) | Systems and methods for estimating stability of a dataset | |
| WO2024261225A1 (en) | Data item classification | |
| CN116151830A (en) | Method and system for early warning on-transit fund risk | |
| US20230409644A1 (en) | Systems and method for generating labelled datasets | |
| US12412412B1 (en) | Unstructured data identification and workflow execution using machine-learning techniques | |
| US20200242571A1 (en) | Automated Check Encoding Error Resolution | |
| US20250390959A1 (en) | Transaction data processing systems and methods | |
| EP4579565A1 (en) | Machine learning based systems and methods for data mapping for remittance documents | |
| US12265553B2 (en) | AI-augmented composable and configurable microservices for record linkage and reconciliation | |
| US20240331056A1 (en) | Ai-augmented composable and configurable microservices for determining a roll forward amount | |
| WO2024043795A1 (en) | Methods, systems and computer-readable media for training document type prediction models, and use thereof for creating accounting records | |
| US20250094957A1 (en) | Systems and methods for dynamic and flexible beneficiary analysis |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24735982 Country of ref document: EP Kind code of ref document: A1 |