WO2020151015A1

WO2020151015A1 - Scoring documents in document retrieval

Info

Publication number: WO2020151015A1
Application number: PCT/CN2019/073262
Authority: WO
Inventors: Weiwei Deng; Chen Gu; Tunzi TAN; Ding FU; Yixin FANG; Dan Luo; Qi Zhang
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2020-07-30
Anticipated expiration: 2021-07-25
Also published as: CN111919208B; CN111919208A

Abstract

A method and apparatus is provided for scoring documents in document retrieval. A query may be received, in which the query comprises one or more terms. A database may be searched to obtain a document containing at least one term of the one or more terms. Each term of the one or more terms may be classified in accordance with whether the term is contained in the document, each classified term being a matched term or an unmatched term. A matching pattern between the query and the document may be obtained, the matching pattern indicating an integration of the each classified term. A relevance score of the document with respect to the query may be determined based at least on the matching pattern.

Description

SCORING DOCUMENTS IN DOCUMENT RETRIEVAL

BACKGROUND

Information retrieval, such as document retrieval, deals with representation, storage, organization of information items, and access to the information items. In document retrieval, a plurality of documents may be obtained for a given query through searching. Inverted index has been widely used in document searching, especially for large document sets such as web search, sponsored search and e-commercial search. When a plurality of documents is obtained, scoring process may be applied to the obtained documents to determine respective score of the document. Then top-k documents for the given query may be selected and presented from the plurality of searched documents based on their scores determined by certain scoring methods.

SUMMARY

This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure propose method and apparatus for scoring documents in document retrieval. In some implementations, a query may be received, in which the query comprises one or more terms. A database may be searched to obtain a document containing at least one term of the one or more terms. Each term of the one or more terms may be classified in accordance with whether the term is contained in the document, each classified term being a matched term or an unmatched term. A matching pattern between the query and the document may be obtained, the matching pattern indicating an integration of the each classified term. A relevance score of the document with respect to the query may be determined based at least on the matching pattern.

It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

FIG. 1 illustrates an exemplary architecture of a system for document retrieval through a network according to an embodiment.

FIG. 2 illustrates an exemplary process for document retrieval according to an embodiment.

FIG. 3 illustrates an exemplary inverted index showing an exemplary matching pattern between a document and a query according to an embodiment.

FIG. 4 illustrates an exemplary Deep Neural Network (DNN) scoring model for document retrieval according to an embodiment.

FIG. 5 illustrates another exemplary process for document retrieval according to an embodiment.

FIG. 6 (A) illustrates an exemplary inverted index before cursors moving according to an embodiment.

FIG. 6 (B) illustrates another exemplary inverted index after cursors moving according to an embodiment.

FIG. 7 illustrates a flowchart of an exemplary method for scoring documents in document retrieval according to an embodiment.

FIG. 8 illustrates an exemplary apparatus for scoring documents in document retrieval according to an embodiment.

FIG. 9 illustrates another exemplary apparatus for scoring documents in document retrieval according to an embodiment.

DETAILED DESCRIPTION

The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

In information retrieval (IR) , such as document retrieval, given a query Q and a document database D, top-k documents may be returned according to some scoring functions, score (d, Q) , d∈D. Here, a score of a document with respect to a query indicates relevance between a query and the document and it may be referred to as relevance score. As for top-k document retrieval in existing systems, in particular for text retrieval tasks, inverted index is usually a primary choice to catch original text representations. In document retrieval, relevance scores of documents are used to rank the corresponding documents so as to return top-k documents to be presented to a user. As for most of current scoring functions, a re-ranking process may be needed to refine candidates for the best top-k documents. This may add additional cost to rank the candidates in two stages, such as top-k retrieval stage and re-ranking stage. Moreover, in the top-k retrieval stage of the current scoring functions, potential candidates may be lost due to limitation of the scoring functions, since the current scoring functions in the document retrieval are unsupervised, such as based on statistical formulas or probabilistic models. These scoring functions cannot be benefited from available labeled data sets and only matched terms in a query are considered in documents scoring.

Embodiments of the present invention propose a method for scoring documents in document retrieval, for example, by leveraging a supervised machine learning (ML) scoring function, which considers matched terms and unmatched terms between a query and a document, coexistence of matched and unmatched terms, and positions of unmatched and matched terms in the query. Herein, the matched terms represent terms in a query contained in a document and the unmatched terms represent terms in the query not contained in the document. In some implementations, logistic regression (LR) and deep neural network (DNN) models may be used as examples of the machine learning scoring models. Herein, a “term” in a query may represent an index unit, which may comprise one or more of characters, words, phrases, numbers, and so on. The presented machine learning scoring function outperforms traditional scoring functions on both accuracy and efficiency in document retrieval and may be applied to different tasks, such as web search or recommendation system. For example, the machine learning scoring function may be used in search recommendation, news recommendation, advertisement recommendation, product recommendation, question-answer application, etc.

FIG. 1 illustrates an exemplary architecture of a system 100 for document retrieval through a network according to an embodiment.

In FIG. 1, a network 110 is applied for interconnecting among a terminal device 120 and a server 130.

The network 110 may be any type of networks capable of interconnecting network entities. The network 110 may be a single network or a combination of various networks. In terms of coverage range, the network 110 may be a Local Area Network (LAN) , a Wide Area Network (WAN) , etc. In terms of carrying medium, the network 110 may be a wireline network, a wireless network, etc. In terms of data switching techniques, the network 110 may be a circuit switching network, a packet switching network, etc.

The terminal device 120 may be any type of electronic computing devices capable of connecting to the network 110, accessing servers or websites on the network 110, processing data or signals, etc. For example, the terminal device 120 may be desktop computers, laptops, tablets, smart phones, AI terminals, etc. Although only one terminal device is shown in FIG. 1, it should be appreciated that a different number of terminal devices may connect to the network 110.

In an implementation, the terminal device 120 may comprise a query input component 121, a display interface 122, a memory 123, a processor 124, and at least one communication connection 125 for connecting to the network 110.

The query input component 121 may be configured to receive a search query from a user who is attempting to obtain retrieved information, such as documents, via an input device, not shown in FIG. 1, in communication with the terminal device 120. For example, the input device may include a mouse, joystick, key pad, microphone, I/O components or any other component capable of receiving a user input and communicating an indication of that input to the terminal device 120. A search query or a query as used herein, may comprise one or more terms reflecting the user’s intent for attempting to obtain retrieved information or documents including at least one term in the query from a database 160, such as a website, cloud storage, a document corpus, and so on. In some examples, the document retrieval may be performed by utilizing inverted indexes from the database 160. The database 160 may include a large number of documents, such as webpages, advertisements, news, and so on.

Top-k retrieved documents may be returned from the server 130 and presented to the user through the display interface 122. The display interface 122 may be specific to the terminal device 120, or be incorporated in a third-party device, such as, a displayer external to the terminal device 120, a TV, a projector, etc.

The memory 123 may be one or more devices for storing data, which may comprise readable-only memory (ROM) , random access memory (RAM) , magnetic RAM, core memory, magnetic disk storage medium, optical storage medium, flash memory, and/or other machine readable medium for storing information. The term of “machine readable medium” may comprise but not be limited to portable or fixed storage, optical storage, wireless channel and various other medium for storing, containing or carrying instructions and/or data.

The processor 124 may include any number of processing units, and is programmed to execute computer-executable instructions for implementing aspects of the disclosure. In some examples, the processor 124 is programmed to execute computer-executable instructions stored in the memory 123 to implement method such as illustrated in accompanying FIG. 7.

The communication connection 125 may comprise wireless connection or wired connection to make the terminal device 120 connecting to the network 110 or any other device.

The server 130 may connect to the database 160. Although not shown in FIG. 1, the server 130 may incorporate the database 160. The server 130 may provide data or retrieved documents to the terminal device 120. As one example, the server 130 may be a web server providing data over a web. The server 130 may provide the data over the web to the terminal device 120 through the network 110.

The server 130 may comprise a query receiving component 140 and a scoring model 150 including a feature weight index 151. The query receiving component 140 may receive the query inputted by the user from the terminal device 120, for example, through the network 110. The received query may be provided to the scoring model 150 for searching one or more documents matched with the query from the database 160. That is, the matched documents contain one or more terms in the query. A plurality of features associated with the query may be generated from the query, each having a weight in the feature weight index 151. Scores of the matched documents may be calculated or determined by the scoring model 150 based at least one the features generated from the query and the corresponding weights in the feature weight index 151. Although the feature weight index 151 is shown as included in the scoring model 150 in FIG. 1, it may be stored in a storage separated from the scoring model 150 and/or the server 130, and respective weight of each feature may be provided to the scoring model 150, if applicable. Top-k documents may be selected from the matched documents based on the scores of the matched documents and be provided to the terminal device 120, for example through the network 110, to be presented to the user through the display interface 122.

Although the query receiving component 140 and the scoring model 150 are shown as being incorporated in the server 130 in FIG. 1, they may be configured as independent components separated from the server 130.

It should be understood that, all components shown in FIG. 1 may be exemplary, and any components in the terminal device 120 may be added, omitted or replaced in FIG. 1 depending on specific application requirements.

FIG. 2 illustrates an exemplary process 200 for document retrieval according to an embodiment.

At 202, a query Q may be received, for example, from a user. The query may be a search query and contain one or more terms. As an example, a query “home depot wood floor” may be received herein. The exemplary terms contained in the query may be “home” , “depot” , “wood” and “floor” .

At 204, a database D may be searched to obtain documents containing at least one term of the one or more terms, for example, by utilizing inverted index structure. Herein, such documents may be referred to as “matched documents” . In the example shown herein, documents containing at least one term of “home” , “depot” , “wood” and “floor” may be obtained. Each of the matched documents may be assigned a DocID in the inverted index structure, where DocID is the ID of a document d∈D containing the at least one term and may be shown as Doc 1, Doc 3, Doc7, etc. as shown below in a simple inverted index structure in FIG. 3, or simply shown as <1>, <3>, <7>.... The inverted index structure used herein may comprise two main sections: dictionary and posting lists. The dictionary comprises various terms, which may be followed by some possible items, such as inverse document frequency (IDF) , term frequency (TF) , upperbound and entry address of related posting lists, which are not shown in the present disclosure. The followed items, such as IDF, TF, etc., are well known and the detailed description for them is omitted herein for simplicity. Each posting list may contain a series of postings represented by DocIDs. Each term in the query is mapped to a posting list. For example, as for term 1 shown in FIG. 3, which is “home” in this example, a posting list for term 1 in the query may be < Doc3, Doc7, Doc9, Doc13... >.

At 206, each term in the query may be classified in accordance with whether the term is contained in the document. The each classified term is a matched term or an unmatched term. For example, as shown in FIG. 3, for document Doc3 which contains terms “home” and “floor” , term 1 (home) , term 4 (floor) in a query “home depot wood floor” are classified as matched terms between the query and Doc3, and term 2 (depot) and term 3 (wood) are classified as unmatched terms.

At 208, respective matching pattern between each matched document and the query may be obtained. The matching pattern may indicate an integration of the each classified term in the query. Herein the integration may identify each term in the query, classification of the each term and position of the each term in the query. That is, the matching pattern may indicate respective position of each term in the query and classification of the each term. Herein the matching pattern is represented by a vector, and each dimension of the vector has a binary value, with 1 on a position of matched term in a query and 0 on a position of unmatched term in a query. If there are 4 terms in a query and for a particular document, there is only the first term matched, then the exemplary matching pattern between the query and the particular document may be represented as [1000] , wherein “1” means that the pair of query and document has a matched term on this position, and “0” means that there is no matched term on these positions. Taking a query “home depot wood floor” as an example, if a matching pattern between the query and a particular matched document is represented as [1001] , the matching pattern indicates that the particular matched document contains the first and fourth terms in the query and does not contain the second and third terms. An exemplary matching pattern determination will be illustrated in connection with FIG. 3 described below.

At 210, respective relevance score of each matched document may be calculated with respect to the query, for example, through a machine learning scoring model based at least on the matching pattern obtained at 208. For example, a relevance score of a matched document may be determined by calculating a score corresponding to a matching pattern between the query and the matched document. Herein the calculated score corresponding to a matching pattern may be referred to as “original score” below, for sake of explanation. The machine learning scoring model may comprise logistic regression (LR) model, gradient boosted decision tree (GBDT) model, and deep neural network (DNN) model and so on. Herein LR model and DNN model may be taken as exemplary scoring models below to calculate the relevance score of the matched document. It should be appreciated that any other appropriate scoring model may be utilized to implement the present disclosure according to particular practice requirements.

Taking the query Q “home depot wood floor” and a particular document A containing terms of “home” and “floor” as an example, the calculation for an original score corresponding to the matching pattern will be described below.

Original score calculation

From the above query Q and the particular document A, a matching pattern between the query Q and the document A may be obtained as [1001] . Based at least on the query Q, in particular, based on matched terms and unmatched terms in the query, and the matching pattern [1001] , a plurality of related features of different types may be generated, each feature having a weight. In some examples, respective weight may be trained offline and pre-assigned to each feature. The original score corresponding to the matching pattern may be calculated based on the weights of the plurality of related features, for example, being considered as a sum of the weights of the related features.

The weight of each feature may be trained or learned from a set of training data. The training data may comprise a historical query and a plurality of retrieved historical documents. Each historical document may be labeled with a relevance score calculated based on e.g., a click number, etc., or assigned by a human. The weights of features may be updated periodically or on demand, according to a set of updated training data. For example, the updated training data may comprise at least one of a retrieved historical document with an updated or different score, or retrieved documents for a new query, etc.

In one implementation, the different types may be predefined or predesigned. For example, the following types may be predefined:

· QueryMatchedTerm: matched terms in the query,

· QueryUnmatchedTerm: unmatched terms in the query,

· QueryMatchedTerm Cross QueryUnmatchedTerm: a combination of anyone matched term and anyone unmatched term in the query,

· QueryMatchedTerm Cross QueryMatchedOtherTerm: a combination of anyone matched term and any another matched term in the query,

· QueryLength Cross QueryMatchedLength: a combination of the number of terms and the number of matched terms in the query,

· QueryContinuousMatchedTerms: continuous matched terms in the query,

· QueryMatchedChunk: a chunk of matched terms in the query, no matter if they are continuous,

· QueryMatchedUnmatchedBigram: bigram of anyone matched term and anyone unmatched term in the query, for example, two continuous terms with one matched and another unmatched,

· QueryUnmatchedBigram: bigram of unmatched terms in the query, for example, two continuous unmatched terms,

· QueryMatchedTerm Cross QueryMatchedPattern: a combination of anyone matched term in the query and a corresponding matching pattern,

· QueryUnmatchedTerm Cross QueryMatchedPattern: a combination of anyone unmatched term in the query and a corresponding matching pattern.

It should be understood that, although several feature types are listed above, there may be any other appropriate feature types predefined or predesigned according to particular implementation requirements.

Table 1 shows the exemplary feature types and features generated from the query Q “home depot wood floor” and the matching pattern [1001] .

Table 1

Herein LR model and DNN model will be discussed below as examples of implementing the machine learning scoring function through using the above features.

Logarithmic loss (LogLoss) is the loss function for these models, as indicated by Equation (1) .

LogLoss (y, p) =- (ylog (p) + (1-y) log (1-p) ) Equation (1)

where “y” stands for the label and p (·) is the predicted probability of model.

A LR model for scoring is represented as the following score (d, Q) by Equation (2) .

where “y = 1” stands for “the document is relevant to the query” and p (·) returns a possibility value, d represents the document, Q represents the query, w is a weight of a feature for the query, x is a feature input vector, T represents transposition, and b represents a bias.

An exemplary DNN model for calculating a relevance score of a matched document at 210 will be described below in connection with FIG. 4.

In some examples, after training, some features with low weights may be deleted from the feature list for the consideration of model capacity.

When the original score corresponding to the exemplary matching pattern [1001] is calculated, the relevance score of the matched document A will be determined based on the original score, for example, the same as the original score, multiple of the original score, and so on. Relevance score of each matched document may be calculated based on respective original score corresponding to its matching pattern.

Optionally, at 212, a plurality of matched documents with respective relevance score, comprising the document A may be returned or provided to a user in a sequence, as a result of document retrieval. For example, the returned or provided plurality of matched documents may be the top-k matched documents. The returned or provided top-k documents may be ranked based on the calculated relevance scores of the matched documents.

It should be appreciated that the method 200 may further comprise any other or alternative steps for document retrieval according to the embodiments of the present disclosure as mentioned above.

FIG. 3 illustrates an exemplary inverted index 300 showing an exemplary matching pattern 310 for Document 3 according to an embodiment. The inverted index may show each term in a query and a plurality of documents matched with any term in the query.

A simple inverted index is shown in FIG. 3. Each row represents one term in the query, and each column represents a document indexed by a DocID. In FIG. 3, the exemplary query comprises four terms, term 1 (home) , term 2 (depot) , term 3 (wood) , term 4 (floor) , and for each term in the query, there are several matched documents indicated by DocIDs in the inverted index. For example, for term 1, there are matched documents indicated by Doc3, Doc7, Doc9, Doc13 in the inverted index; for term 2, there are matched documents indicated by Doc1, Doc7, Doc10, Doc11, and so on. From the inverted index in FIG. 3, a matching pattern between the query and a document may be obtained. For example, a matching pattern between the query and Doc1 is [0101] , a matching pattern between the query and Doc3 is [1001] which is exemplarily highlighted by a dashed box 310 in FIG. 3, a matching pattern between the query and Doc7 is [1110] , and so on.

It should be understood that, the structure of the inverted index 300 shown in FIG. 3 is just exemplary; any other structure of the inverted index for document retrieval may be possible for the present disclosure.

FIG. 4 illustrates an exemplary Deep Neural Network (DNN) scoring model 400 for document retrieval according to an embodiment.

As shown in FIG. 4, features of different feature types are inputted into the DNN model 400 in a form of sparse input, such as, features of feature type i, features of feature type j, etc. In an embedding layer, instead of one weight per feature, the DNN model 400 applies a weight vector per feature. For example, a weight vector for the a ^th feature of the feature type i is

with k values. In FIG. 4, a and b are numbers of features of feature type i and j respectively. Each vector in the embedding layer is passed through a pooling layer to obtain a weight vector for all the features of each feature type. After non-linear transformation of the weight vectors through two hidden layers, e.g., Hidden Layer1 and Hidden Layer2, a score of a document is generated and outputted through an output layer.

FIG. 5 illustrates another exemplary process 500 for document retrieval according to an embodiment. The exemplary process 500 is an optimization way over the process 200 by considering a plurality of potential matching patterns for a given query and pre-calculating a plurality of scores for the plurality of potential matching patterns, which may be performed before, at the same time of, or after operations of searching in a database to obtain matched documents, classifying each term in the query and/or obtaining matching patterns between the query and the matched documents. In some examples, the plurality of potential matching patterns for a given query may comprise all possible or potential matching patterns for the given query. In some other examples, the plurality of potential matching patterns for a given query may comprise a number of potential matching patterns less than all potential matching patterns.

At 502, a query Q may be received, for example, from a user. The query may be a search query and comprise one or more terms. As an example, a query “home depot wood floor” may be received herein. The exemplary terms contained in the query may be “home” , “depot” , “wood” and “floor” .

At 504, a plurality of potential matching patterns of the query Q may be determined. For example, if there are n terms in the query Q, each term can potentially be matched or unmatched with a document indicated by a DocID in the inverted index. Therefore, up to 2 ⁿ potential matching patterns of the query Q, which may be all possible or potential matching patterns, may be determined. For example, given a query including three terms, this query may have 2 ³ potential matching patterns, including [000] , [001] , [010] , [100] , [011] , [101] , [110] , [111] . Since the matching pattern with all zero, such as [000] indicates that there is no document matched with the query, such matching pattern may be not considered in scoring document for document retrieval. In this case, the number of potential matching patterns whose score to be calculated may be less than all potential matching patterns of a given query, for example, 2 ⁿ-1 potential matching patterns excluding the pattern with all zero.

At 506, for each of the plurality of potential matching patterns determined at 504, an original score or an upperbound score may be obtained.

An original score corresponding to each potential matching pattern may be obtained through making calculation based on weights of a plurality of features associated with the query, which is similar to the above original score calculation. Therefore, the detailed description for the original score calculation for each potential matching pattern may be omitted herein for simplicity.

Herein the upperbound score corresponding to a matching pattern may be a highest score among scores of sub-patterns of the matching pattern, such as a max score among multiple scores. For example, sub-patterns of a particular matching pattern may comprise a set of all matching pattern with at least one matched term in the particular matching pattern. For example, if a matching pattern is [101] , then its sub-patterns may be [100] , [001] , [101] ; if a matching pattern is [010] , then its sub-pattern may be [010] ; if a matching pattern is [111] , then its sub-patterns may be [001] , [010] , [011] , [100] , [101] , [110] , [111] . The upperbound score for a matching pattern may be utilized to skip less related documents from a plurality of matched documents, to improve retrieval performance. The obtaining of upperbound score corresponding to the matching pattern will be described below.

Upperbound score obtaining

Given a query Q = [t ₁, t ₂, ..., t _m] with m distinct terms, the original score for each potential matching pattern may be calculated, and an upperbound score for a particular potential matching pattern may be derived or obtained according to the original scores of sub-patterns of the particular potential matching pattern.

If a query has 3 terms, {a, b, c} , then an upperbound score for each potential matching pattern may be obtained as follow:

upperbound ( {a, b, c} )

= maxScore ( {1/0, 1/0, 1/0} )

= max (maxScore ( {1/0, 1/0, 0} ) , maxScore ( {0, 1/0, 1/0 } ) , maxScore ( {1/0, 0, 1/0} ) , score ( {1, 1, 1} ) )

= max (upperbound ( {a, b} ) , upperbound ( {b, c} ) , upperbound ( {a, c} ) , score ( {1, 1, 1} ) )

Table 2 lists a plurality of potential matching patterns for an exemplary query including 3 terms and calculated original scores and upperbound scores for each potential matching pattern.

Matching Pattern	Original Score	Upperbound Score
001	2	2
010	3	3
011	7	7
100	6	6
101	5	6
110	8	8
111	9	9

Table 2

It should be understood that the specific original scores and upperbound scores for matching patterns shown in Table 2 are just exemplary, any other possible scores may be calculated based on different weights of different features for different queries.

Each upperbound score of the listed potential matching patterns may be obtained as follow:

for potential matching pattern [001] , upperbound ( [001] ) =max {original score ( [001] ) } =2;

for potential matching pattern [010] , upperbound ( [010] ) =max {original score ( [010] ) } =3;

for potential matching pattern [011] , upperbound ( [011] ) =max {original score ( [001] ) , original score ( [010] ) , original score ( [011] ) } =7;

for potential matching pattern [100] , upperbound ( [100] ) =max {original score ( [100] ) } =6;

for potential matching pattern [101] , upperbound ( [101] ) =max {original score ( [001] ) , original score ( [100] ) , original score ( [101] ) } =6;

for potential matching pattern [110] , upperbound ( [110] ) =max {original score ( [100] ) , original score ( [010] ) , original score ( [110] ) } =8;

for potential matching pattern [111] , upperbound ( [111] ) =max {original score ( [001] ) , original score ( [010] ) , original score ( [100] ) , original score ( [011] ) , original score ( [101] ) , original score ( [110] ) , original score ( [111] ) } =9.

From the above obtained upperbound score and Table 2, although the original score “5” of matching pattern [101] (Term 1 &3 are matched) is lower than the original score “6” of its sub-pattern [100] (Only Term 1 is matched) , the upperbound score of pattern [101] may be equal to a highest original score “6” among the scores of its sub-patterns.

As the documents without any matched term (that is, the matching patterns between the query and such documents are “0...00” ) may be not retrieved or less useful in document retrieval, there is no need to calculate their scores.

The obtained original score or upperbound score for each potential matching pattern of the query may be fed to operation 514 to determine relevance scores of matched documents.

At 508, a database may be searched to obtain matched documents, which is similar to operation 204 in FIG. 2. Therefore, the detailed description of the operation 508 may be omitted herein for simplicity.

At 510, each term in the query may be classified in accordance with whether the term is contained in the document, which is similar to operation 206 in FIG. 2. Therefore, the detailed description of the operation 510 may be omitted herein for simplicity.

At 512, respective matching pattern between each matched document and the query may be obtained, which is similar to operation 208 in FIG. 2. Therefore, the detailed description of the operation 512 may be omitted herein for simplicity.

At 514, respective relevance score of each matched document may be determined with respect to the query, based on the original score or upperbound score corresponding to each potential matching pattern obtained at 506 and the matching pattern between the document and the query obtained at 512. In some examples, a score corresponding to a potential matching pattern identical to the matching pattern of the matched document obtained at 512 may be selected from the plurality of original scores or upperbound scores obtained at 506, and the relevance score of the matched document may be determined according to the selected score.

Optionally, at 516, a plurality of matched documents with respective relevance score, such as top-k documents, may be returned or provided to a user in a sequence, as a result of document retrieval, which is similar to step 212 in FIG. 2. Therefore, the detailed description of the step 516 may be omitted herein for simplicity.

It should be understood that, although the

operations

504, 506 are shown being performed in parallel with the

operations

508, 510, 512 in FIG. 5, they may be performed before, at the same time of or after the

operations

508, 510 and 512.

FIG. 6 (A) illustrates an exemplary inverted index 610 before cursors moving according to an embodiment, and FIG. 6 (B) illustrates another exemplary inverted index 620 after cursors moving by using upperbound score according to an embodiment.

The usage of upperbound score may bring retrieval optimization in the inverted index. For the m distinct terms in a query Q, there may be m corresponding posting lists in the inverted index, including the empty ones.

Taking a query including three terms as an example, there may be three posting lists with cursors, as shown in FIG. 6 (A) . In a retrieval process based on the upperbound score, a threshold may be used to determine a “pivot” document. Herein, the pivot document may represent a document which has a score above a threshold. In this example, a threshold θ for determining the “pivot” document may be assumed as θ=8. A pivot document in a plurality of matched documents associated with the query may be determined through ranking the posting lists based on sequence of DocIDs to which cursors pointed currently. For example, in FIG. 6 (A) , cursor 1 for term 1 in the query is pointed to <4> currently, cursor 2 for term 2 in the query is pointed to <3>currently, and cursor 3 for term 3 is pointed to <7> currently. Based on respective term and corresponding DocID to which the cursors pointed, ranked posting lists may be shown below as in Table 3.

Posting list (p) ID	1	2	3
TermID	Term2	Term1	Term3
DocID	<3>	<4>	<7>

Table 3

According to the order in Table 3 in combination with the upperbound score in Table 2, the pivot document may be determined below step by step:

· For p = 1, Upperbound ( [0, 1, 0] ) =3, which does not satisfy score>θ (θ=8) , so the retrieval process goes to the next posting list p = 2;

· For p = 2, Upperbound ( [1, 1, 0] ) =8, which does not satisfy score>θ (θ=8) , so the retrieval process goes to the next posting list p = 3;

· For p = 3, Upperbound ( [1, 1, 1] ) =9, which is above θ (θ=8) , so the retrieval process returns DocID “ <7> ” to indicate the pivot document. Herein DocID “ <7> ” may be considered as a pivot DocID.

When the pivot DocID is determined, which is <7> for Term 3 as shown in FIG. 6 (A) , the cursors of Term1 and Term2 (cursor1 and cursor2) may be moved to a first DocID whose number is more than or equal to <7> respectively, to fully evaluate the score of the document.

FIG. 6 (B) shows the status after cursors moving, wherein cursor1 points to <10> and cursor2 points to <7>. Thus, the matching pattern for document <7> is “011” and the score of document <7> is 7 based on the matching pattern “011” , which is less than the threshold 8. So this document <7> may be skipped and cursor2 and cursor3 may be moved to the respective next DocID. The posting lists may be ranked based on current DocIDs again to find the next pivot DocID until all the posting lists have moved to the end or the next pivot cannot be found due to each of the rest upperbound scores is less than or equal to the threshold θ.

Algorithm 1 below shows pseudo code on how to calculate upperbound score of a matching pattern as described above in a recursive way.

Algorithm 1 CalculateUpperBound algorithm

Require: postingListSet: A pattern with each element initialed as “true” or “false” ;

PatternLen: Returns the number of “true” values in postingListSet;

if patternIndice in upperBoundCache then

Return upperBoundCache [patternIndice]

end if

upperBound = scoreCache [patternIndice] ;

for i = 0; i < PatternLen (postingListSet ) ; + + i do

if postingListSet [i] then

postingListSet . set (i, false) ;

subUpperBound = CalculateUpperBound (postingListSet) ;

if subUpperBound > upperBound then

upperBound = subUpperBound ;

end if

postingListSet . set (i, true) ;

end if

end for

upperBoundCache [patternIndice ] = upperBound;

Return upperBound

FIG. 7 illustrates a flowchart of an exemplary method 700 for scoring documents in document retrieval according to an embodiment.

At 710, a query is received, in which the query comprises one or more terms.

At 720, a database may be searched to obtain a document containing at least one term of the one or more terms.

At 730, each term of the one or more terms is classified in accordance with whether the term is contained in the document. The each classified term may be a matched term or an unmatched term.

At 740, a matching pattern between the query and the document is obtained. The matching pattern may indicate an integration of the each classified term.

At 750, a relevance score of the document with respect to the query is determined based at least on the matching pattern.

In an implementation, the integration identifies each term in the query, classification of the each term and position of the each term in the query.

In an implementation, the method 700 may further comprise: determining a plurality of potential matching patterns of the query; and obtaining a plurality of scores corresponding to the plurality of potential matching patterns by calculating a score corresponding to each potential matching pattern of the plurality of potential matching patterns.

In an implementation, the relevance score is determined by selecting, from the plurality of scores, a score corresponding to a potential matching pattern identical to the matching pattern.

In an implementation, calculating the score corresponding to each potential matching pattern comprises: generating a plurality of features of different types for the potential matching pattern based on the potential matching pattern and the query, wherein each feature has a weight; and calculating the score corresponding to the potential matching pattern based on weights of the plurality of features.

In an implementation, calculating the score corresponding to each potential matching pattern comprises: calculating at least one original score corresponding to at least one sub-pattern of the potential matching pattern; and obtaining the score corresponding to the potential matching pattern by selecting a highest original score among the at least one original score.

In an implementation, the relevance score is determined by calculating a score corresponding to the matching pattern.

In an implementation, calculating the score corresponding to the matching pattern comprises: generating a plurality of features of different types for the matching pattern based on the matching pattern and the query, wherein each feature has a weight; and calculating the score corresponding to the matching pattern based on weights of the plurality of features.

In an implementation, the different types comprise at least one of: matched terms in the query, unmatched terms in the query, combination of anyone matched term and anyone unmatched term in the query, combination of anyone matched term and any another matched term in the query, combination of the number of terms and the number of matched terms in the query, continuous matched terms in the query, chunk of matched terms in the query, bigram of anyone matched term and anyone unmatched term in the query, bigram of unmatched terms in the query, combination of anyone matched term in the query and a corresponding matching pattern, and combination of anyone unmatched term in the query and a corresponding matching pattern.

In an implementation, the weight of each feature is trained based on a set of training data. Each training data may comprise a historical query and a plurality of retrieved historical documents, and each historical document may be labeled with a relevance value.

In an implementation, the matching pattern is represented by a vector, and each dimension of the vector has a value of zero or one.

It should be appreciated that the method 700 may further comprise any steps/operations for scoring documents in document retrieval according to the embodiments of the present disclosure as mentioned above.

FIG. 8 illustrates an exemplary apparatus 800 for scoring documents in document retrieval according to an embodiment.

The apparatus 800 may comprise: a query receiving module 810, for receiving a query, the query comprising one or more terms; a searching module 820, for searching in a database to obtain a document containing at least one term of the one or more terms; a classifying module 830, for classifying each term of the one or more terms in accordance with whether the term is contained in the document, wherein the each classified term is a matched term or an unmatched term; a matching pattern obtaining module 840, for obtaining a matching pattern between the query and the document, the matching pattern indicating an integration of the each classified term; and a score determining module 850, for determining a relevance score of the document with respect to the query based at least on the matching pattern.

In an implementation, the integration identifies each term in the query, classification of the each term and position information of the each term in the query.

In an implementation, the matching pattern obtaining module 840 is further for determining a plurality of potential matching patterns of the query, and the score determining module 850 is further for obtaining a plurality of scores corresponding to the plurality of potential matching patterns by calculating a score corresponding to each potential matching pattern of the plurality of potential matching patterns.

In an implementation, the score determining module 850 is further for: generating a plurality of features of different types for the potential matching pattern based on the potential matching pattern and the query, wherein each feature has a weight; and calculating the score corresponding to the potential matching pattern based on weights of the plurality of features.

In an implementation, the score determining module 850 is further for: calculating at least one original score corresponding to at least one sub-pattern of the potential matching pattern; and obtaining the score corresponding to the potential matching pattern by selecting a highest original score among the at least one original score.

Moreover, the apparatus 800 may also comprise any other modules configured for scoring documents in document retrieval according to the embodiments of the present disclosure as mentioned above.

FIG. 9 illustrates another exemplary apparatus 900 for scoring documents in document retrieval according to an embodiment.

The apparatus 900 may comprise at least one processor 910. The apparatus 900 may further comprise a memory 920 that is connected with the processor 910. The memory 920 may store computer-executable instructions that, when executed, cause the processor 910 to perform any operations of the methods for scoring documents in document retrieval according to the embodiments of the present disclosure as mentioned above.

The embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for assisting psychological cure in automated chatting according to the embodiments of the present disclosure as mentioned above.

It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip) , an optical disk, a smart card, a flash memory device, random access memory (RAM) , read only memory (ROM) , programmable ROM (PROM) , erasable PROM (EPROM) , electrically erasable PROM (EEPROM) , a register, or a removable disk. Although memory is shown separate from the processors in the various aspects presented throughout the present disclosure, the memory may be internal to the processors (e.g., cache or register) .

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

A method for scoring documents in document retrieval, comprising:

receiving a query, the query comprising one or more terms;

searching in a database to obtain a document containing at least one term of the one or more terms;

classifying each term of the one or more terms in accordance with whether the term is contained in the document, wherein the each classified term is a matched term or an unmatched term;

obtaining a matching pattern between the query and the document, the matching pattern indicating an integration of the each classified term; and

determining a relevance score of the document with respect to the query based at least on the matching pattern.
The method of claim 1, wherein the integration identifies each term in the query, classification of the each term and position of the each term in the query.
The method of claim 1, further comprising:

determining a plurality of potential matching patterns of the query; and

obtaining a plurality of scores corresponding to the plurality of potential matching patterns by calculating a score corresponding to each potential matching pattern of the plurality of potential matching patterns.
The method of claim 3, wherein the relevance score is determined by selecting, from the plurality of scores, a score corresponding to a potential matching pattern identical to the matching pattern.
The method of claim 3, wherein calculating the score corresponding to each potential matching pattern comprises:

generating a plurality of features of different types for the potential matching pattern based on the potential matching pattern and the query, wherein each feature has a weight; and

calculating the score corresponding to the potential matching pattern based on weights of the plurality of features.
The method of claim 3, wherein calculating the score corresponding to each potential matching pattern comprises:

calculating at least one original score corresponding to at least one sub-pattern of the potential matching pattern; and

obtaining the score corresponding to the potential matching pattern by selecting a highest original score among the at least one original score.
The method of claim 1, wherein the relevance score is determined by calculating a score corresponding to the matching pattern.
The method of claim 7, wherein calculating the score corresponding to the matching pattern comprises:

generating a plurality of features of different types for the matching pattern based on the matching pattern and the query, wherein each feature has a weight; and

calculating the score corresponding to the matching pattern based on weights of the plurality of features.
The method of claim 5 or 8, wherein the different types comprise at least one of:

matched terms in the query,

unmatched terms in the query,

combination of anyone matched term and anyone unmatched term in the query,

combination of anyone matched term and any another matched term in the query,

combination of the number of terms and the number of matched terms in the query,

continuous matched terms in the query,

chunk of matched terms in the query,

bigram of anyone matched term and anyone unmatched term in the query,

bigram of unmatched terms in the query,

combination of anyone matched term in the query and a corresponding matching pattern, and

combination of anyone unmatched term in the query and a corresponding matching pattern.
The method of claim 5 or 8, wherein the weight of each feature is trained based on a set of training data, each training data comprising a historical query and a plurality of retrieved historical documents, each historical document being labeled with a relevance value.
The method of claim 1, wherein the matching pattern is represented by a vector, and each dimension of the vector has a value of zero or one.
An apparatus for scoring documents in document retrieval, comprising:

a query receiving module, for receiving a query, the query comprising one or more terms;

a searching module, for searching in a database to obtain a document containing at least one term of the one or more terms;

a classifying module, for classifying each term of the one or more terms in accordance with whether the term is contained in the document, wherein the each classified term is a matched term or an unmatched term;

a matching pattern obtaining module, for obtaining a matching pattern between the query and the document, the matching pattern indicating an integration of the each classified term; and

a score determining module, for determining a relevance score of the document with respect to the query based at least on the matching pattern.
The apparatus of claim 12, wherein the integration identifies each term in the query, classification of the each term and position of the each term in the query.
The apparatus of claim 12, wherein the matching pattern obtaining module is further for determining a plurality of potential matching patterns of the query, and the score determining module is further for obtaining a plurality of scores corresponding to the plurality of potential matching patterns by calculating a score corresponding to each potential matching pattern of the plurality of potential matching patterns.
The apparatus of claim 14, wherein the relevance score is determined by selecting, from the plurality of scores, a score corresponding to a potential matching pattern identical to the matching pattern.
The apparatus of claim 14, wherein the score determining module is further for:

generating a plurality of features of different types for the potential matching pattern based on the potential matching pattern and the query, wherein each feature has a weight; and

calculating the score corresponding to the potential matching pattern based on weights of the plurality of features.
The apparatus of claim 16, wherein the different types comprise at least one of:

matched terms in the query,

unmatched terms in the query,

combination of anyone matched term and anyone unmatched term in the query,

combination of anyone matched term and any another matched term in the query,

combination of the number of terms and the number of matched terms in the query,

continuous matched terms in the query,

chunk of matched terms in the query,

bigram of anyone matched term and anyone unmatched term in the query,

bigram of unmatched terms in the query,

combination of anyone matched term in the query and a corresponding matching pattern, and

combination of anyone unmatched term in the query and a corresponding matching pattern.
The apparatus of claim 14, wherein the score determining module is further for:

calculating at least one original score corresponding to at least one sub-pattern of the potential matching pattern; and

obtaining the score corresponding to the potential matching pattern by selecting a highest original score among the at least one original score.
The apparatus of claim 12, wherein the relevance score is determined by calculating a score corresponding to the matching pattern.
An apparatus for scoring documents in document retrieval, comprising:

one or more processors; and

a memory storing computer-executable instructions that, when executed, cause the one or more processors to:

receive a query, the query comprising one or more terms;

search in a database to obtain a document containing at least one term of the one or more terms;

classify each term of the one or more terms in accordance with whether the term is contained in the document, wherein the each classified term is a matched term or an unmatched term;

obtain a matching pattern between the query and the document, the matching pattern indicating an integration of the each classified term; and

determine a relevance score of the document with respect to the query based at least on the matching pattern.