WO2016069767A1

WO2016069767A1 - Classifier recall estimation for sparse topics

Info

Publication number: WO2016069767A1
Application number: PCT/US2015/057851
Authority: WO
Inventors: Praveen BOMMANNAVAR; Alek KOLCZ; Anand Rajaraman
Original assignee: Twitter, Inc.
Priority date: 2014-10-28
Filing date: 2015-10-28
Publication date: 2016-05-06
Also published as: CA2966350A1; KR102262953B1; KR20200139843A; KR20190062613A; BR112017009015A2; KR101984773B1; KR20170139492A; KR102188337B1; EP3213235A1; EP3213235A4; DE202015009963U1

Abstract

A method for document retrieval includes classifying, by a first classifier, a documents to obtain a first set of resulting documents, and classifying, by a second classifier, the documents to obtain a second set of resulting documents. The method further includes estimating a recall metric of the first classifier based on the first set of resulting documents and the second set of resulting documents, revising the first classifier based on the recall metric, and classifying, by the revised classifier, the documents to obtain at least one additional document excluded from the first set of documents. The method further includes sending a communication based on the classification.

Description

CLASSIFIER RECALL ESTIMATION

FOR

SPARSE TOPICS

[0001] This application claims priority, pursuant to 35 U.S.C. § 119(e), to U.S.

Provisional Application No. 62/069,618, filed on October 28, 2014, and entitled, "METHOD AND SYSTEM FOR RECALL ESTIMATION," which is incorporated by reference herein in its entirety.

BACKGROUND

[0002] The instant specification relates to digital information retrieval in the context of social platforms.

[0003] In the field of information retrieval, search engines typically implement indexing and retrieval, responsive to search queries, of networked accessible documents. As used in the instant specification, documents are units of content and can be, for example, Web pages, text files, Portable Document Format files, video files, audio files, image files, and social messages of a social platform. As can be expected, different types of documents will exhibit different features, which in general are programmatically detectable properties of a document that are computationally quantifiable. A particular text file with its particular string of text (often parsable into meaningful terms), for example, will exhibit features commonly referred to as term frequency, document length, and inverse term frequency. A particular photo with its particular captured image will exhibit features relating to color, opacity, and texture. Typically, documents sharing common features are indexed into a same corpus, which facilitates programmatic comparisons of one document with another (i.e., it's easier for a computing resource to compare apples to apples).

[0004] One fundamental aspect of indexing is classification, which for some search engines is performed to determine which topics should be associated with which document. Moreover, a document may describe more than one topic. For example, a photo of a celebrity in front of a famous landmark can be associated with both the celebrity and the landmark. Classification is typically implemented with some combination of programmatic resources and human judgment, especially when the former are leveraging machine learning, which requires training in order to be effective. In operation, the programmatic resource (a classifier) takes as input selected features of a document and apply machine learning to determine whether that document can be classified as one describing the topic for which the classifier is assigned to detect. A sampling of the classifier's decisions are evaluated by human operators, which evaluations are applied to assess the effectiveness of the classifier and to calibrate its machine-learning parameters.

[0005] One measure of effectiveness of a classifier is recall, which, for the implementations in which the classifier is tasked with assigning topics to documents, is its ability to classify properly all of documents of a corpus that describes the topic it is assigned to detect.

[0006] To assess recall for classifiers assigned to detect topics prevalent in the corpus, human operators usually have to sample only a small portion of the corpus. For classifiers assigned to detect topics that are sparse (less than 0.1% of the corpus), however, a much larger sample of the corpus is needed since small samples are unlikely to capture documents describing the sparse topic. Time and cost complicit in this later endeavor can be prohibitively expensive, especially when corpus documents number in the billions.

SUMMARY

[0007] In general, a second classifier (C₂) is constructed to be sufficiently independent from the classifier being evaluated (C , where both classifiers are constructed to detect a same sparse topic of a particular corpus of documents (U). Here, independence generally means having different features, those that are not interwoven. Mathematically speaking, perfect independence means orthogonal features in feature space, and perfect dependence means having features on a same line in feature space. The second classifier C₂ is then used as a proxy for the corpus U in the evaluation of C₂'s ability to recall the sparse topic, and instead of having to sample from the entire corpus U to evaluate C₂'s ability to recall, one has to sample only from the set of documents selected by C₂, which is a significantly smaller subset than U, and hence the sample size needed is significantly reduced. Optionally, perfect independence is not implemented as imperfect independence can yield acceptable accuracy in estimating Q's recall. That is, C₂ can optionally be constructed with features that are not independent but only distantly related to features used by Q.

[0008] In general, in one aspect, one or more embodiments relate to a method for document retrieval. The method includes classifying, by a first classifier, a documents to obtain a first set of resulting documents, and classifying, by a second classifier, the documents to obtain a second set of resulting documents. The method further includes estimating a recall metric of the first classifier based on the first set of resulting documents and the second set of resulting documents, revising the first classifier based on the recall metric, and classifying, by the revised classifier, the documents to obtain at least one additional document excluded from the first set of documents. The method further includes sending a communication based on the classification.

[0009] In general, in one aspect, one or more embodiments relate to a system for recall estimation that includes a computing processor, and instructions, that when executed on the computing device processor classify, by a first classifier, a documents to obtain a first set of resulting documents, and classify, by a second classifier, the documents to obtain a second set of resulting documents. The instructions further includes estimate a recall metric of the first classifier based on the first set of resulting documents and the second set of resulting documents, revise the first classifier based on the recall metric, and classify, by the revised classifier, the documents to obtain at least one additional document excluded from the first set of documents. The instructions further includes send a communication based on the classification.

[0010] In general, in one aspect, one or more embodiments relate to a non- transitory computer readable medium comprising computer readable program code that classify, by a first classifier, a documents to obtain a first set of resulting documents, and classify, by a second classifier, the documents to obtain a second set of resulting documents. The computer readable program code further includes estimate a recall metric of the first classifier based on the first set of resulting documents and the second set of resulting documents, revise the first classifier based on the recall metric, and classify, by the revised classifier, the documents to obtain at least one additional document excluded from the first set of documents. The computer readable program code further includes send a communication based on the classification.

[0011] Some implementations of the invention require a significantly smaller sample size than conventional systems to evaluate effectively a classifier's ability to recall sparse topics. Because sampling requires human judgment, the significantly smaller sample size advantageously imposes a significantly reduced time and cost burden that are not prohibitive and hence allows one to perform the described evaluation even for a corpus that has billions of documents.

[0012] Other aspects of the invention will be apparent from the following description and the appended claims.

BRIEF DESCRIPTION OF DRAWINGS

[0013] FIG. 1A, IB, and 2 show diagrams of a system in accordance with one or more embodiments of the invention. [0014] FIGs. 3 and 4 show flowcharts in accordance with one or more embodiments of the invention.

[0015] FIG. 5 shows an example in accordance with one or more embodiments of the invention.

[0016] FIG. 6 shows a computing system in accordance with one or more embodiments of the invention.

DETAILED DESCRIPTION

[0017] Specific embodiments of the invention will now be described in detail with reference to the accompanying figures. Like elements in the various figures are denoted by like reference numerals for consistency.

[0018] In the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a more thorough understanding of the invention. However, it will be apparent to one of ordinary skill in the art that the invention may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.

[0019] Throughout the application, ordinal numbers (e.g., first, second, third, etc.) may be used as an adjective for an element (i.e., any noun in the application). Lhe use of ordinal numbers is not to imply or create any particular ordering of the elements nor to limit any element to being only a single element unless expressly disclosed, such as by the use of the terms "before", "after", "single", and other such terminology. Rather, the use of ordinal numbers is to distinguish between the elements. By way of an example, a first element is distinct from a second element, and the first element may encompass more than one element and succeed (or precede) the second element in an ordering of elements.

[0020] In general, embodiments of the invention are directed to document retrieval. One or more embodiments have at least two different classifiers that independently classify multiple documents to obtain at least two sets of resulting documents. The classification from at least one classifier is used to generate a recall metric for another classifier. The recall metric is an estimation of a number of documents that belong to the class but are not classified in the class by the classifier. The other classifier may be revised based on the estimated amount.

[0021] In one or more embodiments, the classifiers are applied to sparse document classification in a large corpus. For example, the large corpus may have millions or billions of documents and the number of documents that actually belong to the class may be a tenth of a percent or less of the total corpus. One or more embodiments may use the results of one classifier as a proxy for the corpus of documents in order to generate an estimate of recall for another classifier. In other words, for the purpose of determining the recall metric, the results of the second classifier acts as proxy for the entire corpus of documents. In one or more embodiments, the classifiers satisfy a threshold degree of independence with respect to each other. Independence is discussed below with respect to FIG. 2.

[0022] One or more embodiments of the invention may be applied to a social network. Here, social content is segregated into corpus where those that share common features or attributes are assigned to a same corpus. In this corpus, a document comprises all social content that is attributable to a particular user, and such a document is associated with the user. The corpus hence can consists of multiple documents (hundreds of millions or even billions), each being associated with a particular user. With user consent, an Ads system can infer topics of interest to a user by detecting topics discussed in the user's document and then provide targeted advertisement, the inference being that a user is interested in the topics discussed in her document of social content. Thanks to the recall estimation technique described herein, at least two classifiers for sparse topics can be evaluated and calibrated to recall effectively documents attributable to the sparse topics. For example, one classifier would be over "documents" made up of the social media messages from the account, and the second classifier may be using the users followed by the account/bio text/other signals. Hence, the system can find users interested in the sparse topics, and Ads targeting can be highly granular (i.e., specific to the sparse topic), despite the corpus of documents of social content being very large. Here, sparse topics can be, for example, a particular model of car with particular options, which provides more precise Ads targeting than just the car model.

[0023] Moreover, users sharing similar topics of interest can be grouped together in a same group, and similar curated content can be provided to the users in the group. Curated content relating to news about foreign policy and sports, for example, can be provided to users in a group associated with foreign policy and sports. Curated content relating to news about sports and food can be provided to users of another group, one associated with sports and food. Optionally, user grouping can be implemented to facilitate programmatic curation. Where curated content is modular so that each module covers one topic that may be of interest to users, a programmatic curation system can build curated content for each group of users of a social platform by selecting modules of content that corresponds to the topics of interest associated with the group. Because programmatic curation can be implemented with low latency, the content provided can be fresh content, even breaking news.

[0024] FIG. 1A shows a schematic diagram of a system for a social network in accordance with one or more embodiments of the invention. Specifically, FIG. 1A shows a social network application (100) and a client (120) in accordance with one or more embodiments of the invention. The social network application (100) may also be referred to as a messaging platform in accordance with one or more embodiments of the invention.

[0025] A social network application (100) connects users to other users (i.e., clients) of the social network application (100), exchanges social networking messages between connected users of the social network application (100), and provides an interface for a user to create and view social network messages. In one or more embodiments of the invention, social network messages are broadcast social networking messages that are transmitted to at least a set of users. The users in the set may be self-selected (e.g., followers of the transmitting user) or users that satisfy a certain status with the transmitting user (e.g., belong to a group, friend, family, etc.). The social networking messages may include, for example, a comment from a user on a document, a personal status update, a reference to a document, and/or other information or combination thereof.

[0026] Further, in one or more embodiments of the invention, the social network application (100) includes functionality to receive an original reference from a user for a document, generate a social network reference from the original reference, and transmit the social network reference to other users. Thus, a user may share the document via the social network application (100) by sending a message containing a reference to the document to other users or posting a social network reference to the document. In one or more embodiments of the invention, the original reference is a reference to the location of the published document, such as a uniform resource locator (URL) of a web page. The social network reference is an indirect reference to the location of the published document. Specifically, the social network reference is mapped to the original reference in the discovery repository (148). The social network application may be configured to perform analytics on the engagement of the social network reference and/or shorten the original reference. For example, the social network reference and the original reference may be a hypertext transfer protocol link or another mechanism for referencing the location of a document.

[0027] As shown in FIG. 1A, the social network application (100) has multiple components including a classification engine (150), a machine learning module (not shown), a discovery repository (148), a frontend module (125), a routing module (155), a graph fanout module (130), a delivery module (135), a message repository (140), a connection graph repository (142), a stream repository (144), and an account repository (146). Various components of the social network application (100) may be located on the same device (e.g., a server, mainframe, desktop Personal Computer (PC), laptop, Personal Digital Assistant (PDA), telephone, mobile phone, kiosk, cable box, and any other device) or may be located on separate devices connected by a network (e.g., a local area network (LAN), the Internet, etc.). More than one of each separate component running on a device, as well as any combination of these components, may exist within a given embodiment of the invention. In one or more embodiments of the invention, the social network application (100) is a platform for facilitating real-time communication between one or more entities. For example, the social network application (100) may store millions of accounts of individuals, businesses, and/or other entities (e.g., pseudonym accounts, novelty accounts, etc.). One or more users of each account may use the social network application (100) to send social networking messages to other accounts inside and/or outside of the social network application (100). The social network application (100) may be configured to enable users to communicate in "real-time", i.e., to converse with other users with a minimal delay and to conduct a conversation with one or more other users during simultaneous sessions. In other words, the social network application (100) may allow a user to broadcast social networking messages and may display the social networking messages to one or more other users within a reasonable time frame so as to facilitate a live conversation between the users. Recipients of a social networking message may have a predefined graph relationship with an account of the user broadcasting the social networking message. In one or more embodiments of the invention, the user is not an account holder or is not logged in to an account of the social network application (100). In this case, the social network application (100) may be configured to allow the user to broadcast social networking messages and/or to utilize other functionality of the social network application (100) by associating the user with a temporary account or identifier.

[0029] In one or more embodiments of the invention, the connection graph repository (142) is configured to store one or more connection graphs. FIG. IB shows an example depiction of a connection graph (299) in accordance with one or more embodiments of the invention. As shown in FIG. IB, the connection graph (299) has multiple components, including nodes representing accounts of the social network application (100) (i.e., Account A (200), Account B (202), Account C (204), Account D (206), Account E (208), Account F (210), Account G (212)) and edges connecting the various nodes.

[0030] The connection graph (299) is a data structure representing relationships

(i.e., connections) between one or more accounts. The connection graph (299) represents accounts as nodes and relationships as edges connecting one or more nodes. A relationship may refer to any association between the accounts (e.g., following, friending, subscribing, tracking, liking, tagging, etc.). The edges of the connection graph (299) may be directed and/or undirected based on the type of relationship (e.g., bidirectional, unidirectional), in accordance with various embodiments of the invention.

[0031] Returning to FIG. 1A, in one or more embodiments of the invention, the routing module (155) includes functionality to receive one or more social networking messages and to store the social networking messages in the message repository (140). The routing module (155) may include functionality to assign an identifier to the social networking message and to notify the graph fanout module (130) of a sender of the social networking message.

[0032] In one or more embodiments of the invention, the graph fanout module (130) includes functionality to retrieve graph data from the connection graph repository (142) and to use the graph data to determine which accounts in the social network application (100) should receive the social networking message. The graph data, for example, may reflect which accounts in the social network application are "following" a particular account and are, therefore, subscribed to receive status social networking messages from the particular account.

[0033] In one or more embodiments of the invention, the delivery module (135) includes functionality to receive a list of accounts from the graph fanout module (130) and the message identifier generated by the routing module (155) and to insert the message identifier into stream data associated with each identified account. The delivery module (135) may then store the message list in the stream repository (144). The stream data stored in the stream repository (144) may make up one or more streams associated with one or more accounts of the social network application (100). A stream may be a dynamic list of social networking messages associated with one or more accounts or may reflect any arbitrary organization of social networking messages that is advantageous for the user of an account.

[0034] In one or more embodiments of the invention, the frontend module (125) is a software application or a set of related software applications configured to communicate with external entities (e.g., client (120)). The frontend module (125) may include the application programming interface (API) and/or any number of other components used for communicating with entities outside of the social network application (100). The API may include any number of specifications for making requests from and/or providing data to the social network application (100). For example, a function provided by the API may provide artist/song recommendations to a requesting client (not shown).

[0035] In one or more embodiments of the invention, the frontend module (125) is configured to use one or more of the data repositories (message repository (140), connection graph repository (142), stream repository (144), discovery repository (148), and/or account repository (146)) to define streams for serving social networking messages (i.e., stream data) to a user of the account on the social network application (100). A user may use any client (120) to receive the social networking messages. For example, where the user uses a web-based client to access the social network application (100), an API of the frontend module (125) may be utilized to define one or more streams and/or to serve the stream data to the client for presentation to the user. Similarly, different forms of message delivery may be handled by different modules in the frontend module (125). In one or more embodiments of the invention, the user may specify particular receipt preferences, which are implemented by the frontend module (125).

[0036] In one or more embodiments of the invention, one or more of the data repositories (message repository (140), connection graph repository (142), stream repository (144), account repository (146), discovery repository (148)) is a database and/or storage service residing on one or more servers. For example, one or more of the data repositories may be implemented as a storage service using service-oriented architecture (SOA) and configured to receive requests for data and to provide requested data to other components of the social network application (100). In another example, the message repository (140) may include one or more tables in a distributed database management system (DBMS), a clustered database, a standalone flat file, and/or any storage software residing on one or more physical storage devices. Examples of a storage device may include, but are not limited to, a hard disk drive, a solid state drive, and/or other memory device. Any type of database or storage application may be used, in accordance with various embodiments of the invention.

[0037] In one or more embodiments of the invention, one or more of the data repositories (message repository (140), connection graph repository (142), stream repository (144), account repository (146), discovery repository (148)) is a separate application or set of applications residing on one or more servers external (and communicatively coupled) to the social network application (100). Alternatively, in one or more embodiments of the invention, one or more of the data repositories may be an integrated component of the social network application (100) and/or may reside, either partially or entirely, on one or more common hardware devices (e.g., a server).

[0038] In one or more embodiments of the invention, the discovery repository (148) may store document metadata about documents. The document metadata may include a list of keywords, tracking information, date of the document, a mapping between social media references and original references to documents, and other information.

[0039] In one or more embodiments of the invention, the message repository

(140) includes functionality to store social networking messages and social networking messages metadata. The social networking messages metadata may include an identifier of the originating user of the social networking message, a list of users that received the social networking message, a number of users that received the social networking message, statistics (e.g., a ratio of connected users to the originating user that forward the social networking message versus disconnected users to the originating user that forward the social networking message), time and date at which the social networking message is transmitted, and other information.

[0040] In one or more embodiments of the invention, the connection graph repository (142) may store information about users. Specifically, the connection graph repository may relate user identifiers of a user to user's preferences and history in the social network application (100). For example, the user preferences and history may include language, connections of the user, topics in which the user is interested and other information.

[0041] The classification engine (150) includes functionality to classify documents and determine metrics from classification in accordance with one or more embodiments of the invention. The classification engine (150) is described in further detail in FIG. 2.

[0042] FIG. 2 shows a schematic diagram of a system in accordance with one or more embodiments of the invention. As shown in FIG. 2, the system may include an internet repository (220), a social network repository (208), the classification engine (200), and a communication disperser (218) in accordance with one or more embodiments of the invention.

[0043] An Internet repository (220) is any data repository available via the Internet. In particular, the Internet repository (220) is any storage unit of Internet documents (222). For example, the Internet repository (220) may be a collection of one or more personal computers, servers, mobile devices, or any other computing system or collection thereof that is connected via the Internet. The Internet repository (220) may be controlled by multiple disparate companies and users.

[0044] In one or more embodiments of the invention, the Internet repository (220) includes functionality to store Internet documents (222). Internet documents (222) are documents that are available via the Internet. A document is any type of stored information. For example, a document may be a technical paper, an article, an audio file, a webpage, a file, a tweet, a posting, a user profile, a message posted or communicated via a social media platform, a user's digital footprint or any other stored information. For example, the user's digital footprint may include the collection of documents that the user has accessed, submitted social media messages about, authored, or with which the user has another direct relationship. The document may have document content and metadata. Document content is the content of the document (e.g., the main body of the document) that is presented to users when the document is displayed or played to the user. Metadata is information about the document. For example, the metadata may include the name and size of the document, the author, edit history (e.g., including a user identifier, a time, and tracking information about the edits made for each edit in the edit history), creation time of the document, security features enabled on the document, and other information. For example, the security features may include password protection, public key encryption, or other security features, or a combination of features.

[0045] A document may have one or more attributes. An attribute is a feature of a portion of the document content or metadata. In particular, the attribute may be information explicitly present in the document content or metadata or may be information derived from the document content and/or metadata. For example, an attribute may be a name, a title, a table of contents in the document content, an education level identifier of the targeted reader, a number of retransmissions of the document, size of the document, a uniform resource identifier, or other information.

[0046] The social network repository (208) is any type of storage unit and/or device (e.g., a file system, database, collection of tables, or any other storage mechanism) for storing data. Further, the social network repository (208) may include multiple different storage units and/or devices. The multiple different storage units and/or devices may or may not be of the same type or located at the same physical site. The social network repository (208) may correspond to any of the repositories or any combinations of repositories shown in FIG. 1A.

[0047] The social network repository (208) may store profiles (210) for each user, social network documents (212), and classifier results (e.g., classifier X results (214), classifier Y results (216)). The profiles (210) correspond to information about a user. For example, a profile may include a unique identifier of the user, a name, message preferences for the user, public and private identifier, location information, photo, links to messages, connections to other users and groups, and other profile information. [0048] Social network documents (212) correspond to any document for a social network. For example, the social network documents (212) may include a posting by a user, a technical paper that is stored in the social network repository, a user profile, a status update, or any other document that is local in the social network repository.

[0049] The classifier results (e.g., classifier X results (214), classifier Y results (216)) correspond to results of the classification. For example, the classifier results may include, for each document classified by a classifier, a unique document identifier of the document and a class of the document. In one or more embodiments of the invention, classifier results may be stored with the document or separate from the document. For example, the classifier results may be a relationship between the document and the class of the document. By way of an example, the classifier results may be implemented as tags included in an index of documents in the corpus.

[0050] Continuing with FIG. 2, the system may include a classification engine

(200) with classifier X (202), classifier Y (204), and an analyzer (206). A classifier is hardware, software, firmware, or any combination thereof that includes functionality to classify documents (e.g., social network documents (212) and/or Internet documents (222)). For example, one or both of the classifiers may be a two-way classifier that classifies documents into one of two groups, or a multi-way classifier that classifies documents into one of multiple groups. A two-way classier is a decision rule which, given a document, assigns the document a label from the set {+1,-1}. A +1 label is given to "positive" decisions where the classification decision indicates a belief that the document in question satisfies the search criteria, while a -1 label is given to "negative" decisions. The classifiers make decisions using the attributes of the input document, such as the text in the document, authorship information or connections with other documents. In other words, the classifier performs the classifications based on the attributes having one or more attributes of the document. In one or more embodiments of the invention, the particular attributes used by the classifier is configurable and may be automatically updated.

In equations presented below, various symbols may be used. The following is a table of the symbol and corresponding definition that may appear below.

In one or more embodiments of the invention, the classifiers satisfy a threshold degree of independence with respect to each other. In one or more embodiments of the invention, the amount of in classifiers are independent when the classifiers satisfy the following.

where δ i is the degree of independence between classifier Ci and C₂ as defined by the above equation. In one or more embodiments of the invention, classifier C and C₂ satisfy the threshold degree of independence when 5_t is greater than the threshold degree of independence. For example, the threshold degree of independence may be .95, .99, .90 or any other configurable value. In one or more embodiments of the invention, the above equation may be deemed valid where the classifiers use different and non-overlapping attributes of the documents.

[0053] Another way to characterize the degree of independence between classifiers is based on the documents that are not in the class (i.e., off-topic documents, T c ). The following equation may be used to determine the degree of independence based on the number of documents that are not in the class.

[0054] The analyzer (206) includes functionality to analyze classifier results

(e.g., classifier X results (214), classifier Y results (216)) and perform recall estimation on the classifier results. Recall estimation is the propensity of a classifier to correctly not miss a particular document in the classification. In particular, precision is a measurement of the percentage of documents that are added, by the classifier, to a class and should be in the class (e.g., the number of documents that are correctly in the class by the classifier). Recall estimation, on the other hand, is an estimation of the percentage of documents that are not in the class as classified by the classifier, but should be in the class. For example, in a classification in which a document is either in the class or not in the class, recall estimation is a percentage or other number of documents that are missed being added to the class by the classifier. Recall may be defined using the following equation:

({relevant documents Π {retrieved documents}^')

({relevant documents}) where r is recall percentage, relevant documents are all documents that should be classified into the class, and retrieved documents are documents that are classified into the class.

[0055] Recall is distinct from precision, which is the number of documents that are correctly retrieved by a classifier. In particular, precision may be defined using the following equation:

{{relevant documents] Π {retrieved documents} | P {{retrieved documents}]

where r is recall percentage, relevant documents are all documents that should be classified into the class, and retrieved documents are documents that are classified into the class.

[0056] With a large corpus of documents (e.g., social network documents (212), Internet documents (222)), determining the percentage manually may not be possible. The analyzer (206) includes functionality to execute multiple classifiers (classifier X (202), classifier Y (204)) and perform recall estimation based on the classification results. The analyzer (206) may further include functionality to revise at least one of the classifiers.

[0057] The communication disperser (218) is hardware, software, firmware, or any combination thereof that includes functionality to disburse communications based on the classifier results. The communication disperser (218) may be a part of the social network application and/or an independent product, such as an application used by a third party. For example, the communication disperser may be an advertising server that disperses communications corresponding to advertisements, a recommendation engine that includes functionality to transmit recommendations to a user, or any other disperser of communications. By way of an example, the recommendations may be recommendations of contacts to include in a social network, recommendations of documents, recommendations of profile preferences, or any other recommendation.

[0058] Although FIG. 2 shows two classifiers, additional classifiers may be included without departing from the scope of the invention. Further, although FIG. 2 shows a communication disperser connected to the social network repository, any other module that uses classification results may be used without departing from the scope of the invention.

[0059] While FIGs. 1A, IB, and 2 show a configuration of components, other configurations may be used without departing from the scope of the invention. For example, various components may be combined to create a single component. As another example, the functionality performed by a single component may be performed by two or more components.

[0060] FIGs. 3 and 4 show a flowchart in accordance with one or more embodiments of the invention. While the various steps in this flowchart are presented and described sequentially, one of ordinary skill will appreciate that some or all of the steps may be executed in different orders, may be combined or omitted, and some or all of the steps may be executed in parallel. Furthermore, the steps may be performed actively or passively. For example, some steps may be performed using polling or be interrupt driven in accordance with one or more embodiments of the invention. By way of an example, determination steps may not require a processor to process an instruction unless an interrupt is received to signify that a condition exists in accordance with one or more embodiments of the invention. As another example, determination steps may be performed by performing a test, such as checking a data value to test whether the value is consistent with the tested condition, in accordance with one or more embodiments of the invention.

[0061] FIG. 3 shows a flowchart for classifying documents associated with a user in accordance with one or more embodiments of the invention. In FIG. 3, the documents may be a digital footprint or a collection of documents related to a user.

[0062] In Step 301, the documents are classified to obtain a first set of resulting documents in a class in accordance with one or more embodiments of the invention. In other words, the classifier identifies a document and extracts the attribute values of the document corresponding to the attributes in the attribute set of the classifier. By comparing the attribute values to the attributes of the classes, by which the classifier is trained, the classifier determines to which class the document should be added. The classifier repeats the process for each document. In some embodiments, the classifier operates as a single thread serially. In other embodiments, the classifier may have multiple threads or multiple application instances that classify documents in parallel.

[0063] In Step 303, users corresponding to the documents are identified in accordance with one or more embodiments of the invention. In one or more embodiments of the invention, each document may have one or more users associated with the document. For example, a user may have sent a message having a link to the document, may be an author of the document, may have accessed the document, or may have performed another action with respect to the document. By accessing profile information about the user and/or metadata information for the document, one or more users may be identified that have a relationship to the document. In one or more embodiments of the invention, one or more pre-defined relationship criteria may exist for the user's relationship to the document. For example, rather than any user having a relationship to the document, the pre-defined relationship criteria may specify a pre-defined relationship that the user has in order to be identified in Step 303.

[0064] In Step 305, communications are sent to each of the users corresponding to at least one document that is classified in accordance with one or more embodiments of the invention. In one or more embodiments of the invention, the communication is sent based on the classification. In other words, the communication is sent to each user that has the one or more pre-defined relationships with the document in the class in response to the user having the pre-defined relationship. Communications may be sent via a user interface of an executing application, via a social media message, via electronic mail, or via any other technique.

[0065] In Step 307, documents are classified using a second classifier to obtain a second set of resulting documents that are in the class in accordance with one or more embodiments of the invention. In one or more embodiments of the invention, the second classifier is selected based on a degree of independence to the first classifier. In particular, for an arbitrary document collection, documents tend to be redundant. Namely, the same information is expressed in many different ways throughout the document. Several ways to leverage this redundancy to create conditionally independent classifiers may be used. For example, the second classifier may be trained to classify documents based on a different and non-overlapping set of attributes than the first classifier, based on attribute values of the attributes, based on different sections of content in a document, or based on another characteristic. In other words, in one or more embodiments of the invention, the independence between the first classifier and the second classifier may not be perfect. For example, some dependency may exist between the attributes of the classifier. However, in accordance with one or more embodiments of the invention, the amount of independence between the classifiers satisfies a threshold. Constructing example sets of classifiers are discussed below. In one or more embodiments of the invention, classification of documents by the second classifier may be performed using a same or similar technique as discussed above with reference to Step 301 of FIG. 3.

[0066] In Step 309, a recall metric is estimated based on the first set of resulting documents and the second set of resulting documents in accordance with one or more embodiments of the invention. In particular, one or more embodiments of the invention perform recall estimation on the first classifier using the results of the second classifier. Recall estimation may be performed using the following equation for two independent classifiers (e.g., classifier 1, classifier 2) performing a two-way classification:

In the above equation, r is recall estimation, Ai is the set of documents classified in the class by classifier 1, A₂ is the set of documents classified in the class by classifier 2, A₁₂ is the set of documents classified in the class by both classifiers, pi is the precision of classifier 1, p₂ is the precision of classifier 2, and U is the set of all documents in the collection. The above equation may be modified for additional classifiers, multi-way classifiers, or classifiers not having as much independence.

[0067] In Step 311, a determination is made whether the recall metric satisfies a threshold in accordance with one or more embodiments of the invention. If the recall metric satisfies the threshold, then the flow may proceed to end. In other words, the communication may be deemed to be sent to all targeted users. In one or more embodiments of the invention, if the amount does not satisfy the threshold, the flow may proceed to Step 313. [0068] In Step 313, the first classifier is revised based on the amount to obtain a revised classifier in accordance with one or more embodiments of the invention. Revising the first classifier may include adding the attributes of the second classifier to the first classifier, retraining the first classifier using additional training samples or performing another action.

[0069] In Step 315, the documents are classified using the revised classifier in accordance with one or more embodiments of the invention. In one or more embodiments of the invention, classifying documents by the revised classifier may be performed using a same or similar technique as discussed above with reference to Step 301 of FIG. 3. However, rather than the original classifier classifying documents, a revised classifier classifies the documents. For example, the revised classifier may use the additional attributes or attribute values.

[0070] In Step 317, any additional users are identified based on the classified documents in accordance with one or more embodiments of the invention. Identifying additional users may be performed using a same or similar technique as discussed above with reference to Step 303 of FIG. 3. In other words, using the revised classification results, additional users are identified.

[0071] In Step 319, the communication is sent to the additional users in accordance with one or more embodiments of the invention. Sending the communication may be performed using a same or similar technique as discussed above with reference to Step 305 of FIG. 3. In one or more embodiments of the invention, rather than sending the communication to all users, the communication is sent only to the additional users added by revising the classifier.

[0072] FIG. 4 shows a flowchart for classifying documents, such as for indexing purposes. In FIG. 4, the corpus of document may be virtually any type of document, such as the documents discussed above with reference to FIG. 2. In Step 401, the documents are classified to obtain, in an index, a first set of resulting documents in a class in accordance with one or more embodiments of the invention. In other words, the classifier identifies a document and extracts the attribute values of the document corresponding to the attributes in the attribute set of the classifier. By comparing the attribute values to the attributes of the classes, by which the classifier is trained, the classifier determines to which class the document should be added. The classifier may then add the document to an index accordingly. The classifier repeats the process for each document. In some embodiments, the classifier operates as a single thread serially. In other embodiments, the classifier may have multiple threads or multiple application instances that classify documents in parallel.

[0073] In Step 403, documents are classified using a second classifier to obtain, in an index, a second set of resulting documents that are in the class in accordance with one or more embodiments of the invention. Selecting the second classifier may be performed using the technique described above with reference to Step 307 of FIG. 3. In one or more embodiments of the invention, classification of documents by the second classifier may be performed using a same or similar technique as discussed above with reference to Step 401 of FIG. 4.

[0074] In Step 405, a recall metric is estimated based on the first set of resulting documents and the second set of resulting documents in accordance with one or more embodiments of the invention. In Step 407, a determination is made whether the recall metric satisfies a threshold in accordance with one or more embodiments of the invention. If the recall metric satisfies the threshold, then the flow may proceed to end. In other words, the communication may be deemed to be sent to all targeted users. In one or more embodiments of the invention, if the amount does not satisfy the threshold, the flow may proceed to Step 409. In Step 413, the first classifier is revised based on the amount to obtain a revised classifier in accordance with one or more embodiments of the invention. In Step 415, the documents are classified using the revised classifier in accordance with one or more embodiments of the invention. Performing Steps 405, 407, 409, and 411 may be respectively performed using the technique described above with reference to Step 309, 311, 313, and 315 of FIG. 3.

[0075] In Step 413, a request for communication is received from a user in accordance with one or more embodiments of the invention. For example, a user may request a set of documents that correspond to a class. In other words, the user may request documents that are indexed according to a particular topic. By way of another example, the user may request that analysis is performed based on the classes associated with documents. For example, the analysis may be the percentages and types of documents that are in a class.

[0076] In Step 417, classification results are obtained in response to the request.

In one or more embodiments of the invention, the analysis requested by the user is performed, the documents and/or identifiers of documents that are indexed with the class are obtained, or another set of classification results are obtained.

[0077] In Step 419, the communication is sent to the user in accordance with one or more embodiments of the invention. In other words, the communication is generated with the requested information and sent to the user. Sending the communication may be performed using various techniques, such as via an electronic mail message, in a web page, as a page in a web application, or using any other techniques. In one or more embodiments of the invention, sending a communication may be performed by partitioning a message into packets, and sending the packets on a network that connects various physical devices.

[0078] The following examples of creating independent classifiers are for explanatory purposes only and not intended to limit the scope of the invention. In the following example, consider a pair of status keyword filters that tend to exhibit the conditional independence properties of interest. A keyword filter is a special type of classifier for which a document is marked as on-topic if any of the words in the filter appear in the document. A seed term to be a single word or phrase that captures the topic of interest in the broadest generality possible. Some examples of seed terms would be sports, Mars or Obama. The classifiers are defined as follows: the first classifier in the pair, C_seed, filters on the terms seed; #seed (in the case of social media messages) while the second filter, C_neighbors_> filters on similar neighbors of the seed terms. A number of different scores may be used for measuring overlap. Letting A_seed and A_y denote the documents found by the seed term filter and a candidate keyword _y, respectively, one such similarity score is the Jaccard similarity using the equation below.

The following is an example set of seed terms and high Jaccard similarity neighbors. A first example of seed terms are [apple;#apple] and the similarity neighbors are [#ios6, #ipad3, #iphone, hack,macintosh, iphones, #siri, ios, macbook, icloud, ipad, Samsung, #ipodtouch, 4s, itunes, cydia, cider, #gadget, #tech, #tablet, app, connector, #mac,...]. A second example of seed terms are [mars;#mars] and the similarity neighbors are [rover, nasa, #curiosity, #curiousity, image, mission, surface, #curiosityrover, bruno, milky, budget, @marscuriosity, gale, crater, orbiting, successfully, lands, landing, breathtaking . . . ]. A third example of seed terms are [obama;#obama] and the similarity neighbors are [@barackobama, barack, bush, #mitt2012, #obama2012, obamas, #dems, #gop, #military, romney, #idontsupportobama, potus, #president, administration, pres, #politics, voting . . . ]. A fourth example of seed terms are [Olympics ;#olympics] and the similarity neighbors are [medalist, gold, london, gb, kirani, gymnast, kate, sprinter, winning, won, #boxing, soccer, watch, watching, #usa, #teamgb, #canada, #london, javelin, nbc, match, 2012, 400m . . . ].

[0079] Another similarity score is the asymmetric overlap measure, using the following equation.

PMI, Tversky similarity, or another measure of similarity may be used. Regardless of which measurement is used, the top scoring neighbors (leaving out seed terms) are included in the filter defining C_nei_gllbors. Although some terms appearing in the list of neighbors may not be not on-topic, both filters have overall precisions that are 'reasonable'. In some embodiments, the precision in the denominators of the above equations are only required not to be so small as to make the computation unstable in addition to the conditional independence assumptions.

[0080] The seed terms, for well-behaving topics, might occur independently of other terms, given an on-topic document. For instance, for the topic Olympics, one would expect that the occurrences of the terms 100m, swimming and javelin to occur independently of the seed term Olympics, given that a document is drawn from the set of on-topic documents.

[0081] By way of another example, threshold-based classifiers and a social media classifier pair satisfying the conditional independence property may be considered. In the second example, classifiers induced over a training corpus that produce a real-valued score when evaluated over an out-of-sample datum are considered. For scores above a set threshold, a document is labeled as on- topic. Many commonly used machine-learning techniques fall into this category, including logistic regression, SVM or Naive Bayes. Additionally, the universe of documents (U) to be the set of documents which also contain links to other documents. In the case of social media, the classifier may be limited to social media messages which also include links to documents on the web. Using the threshold based classifiers, a conditionally independent pair may be constructed as follows: one classifier is trained and runs entirely on social media message text while the other is based entirely on the web documents that the social media message links to. In the training, the processes generating each description/web-document pair are independent of each other. In some embodiments, only the conditional independence is held in an approximate sense.

[0082] In the example, as the thresholds for classifiers CI and C2 are lowered, the conditional independence assumptions become increasingly valid. The tradeoff for choosing thresholds at a lower level, however, is that obtaining accurate estimates of the respective precisions pi and p2 becomes more expensive.

[0083] By way of another example, returning to the original universe of documents U, a classifier pair which allows the first classifier, CI, to be either filter-based or threshold-based is considered. The second classifier, C2, is based entirely on social signals. The social signals may be composed of authorship (an author who is primarily known for technology articles would have all of his/her social media messages classified by C2 as on-topic for the topic technology). On a social media platform, @mentions may provide additional social signals. For instance, knowing that a social media message has been sent as a reply to @barackobama may provide a signal that the social media message is about politics. Due to the fact that the signals used by each of these classifiers have independent origins, conditional independence may hold here as well.

[0084] In one or more embodiments of the invention, whatever classifier is being used, two new classifiers may be constructed using the below. Having the recalls of each of those classifiers, an estimate on the total number of positives, n_est, may be formed. That is, the estimate n_est may be computed as follows:

where is the recall of a conditionally independent classifier, p_t is the corresponding precision and Al is the set of documents retrieved by this classifier. Using the above estimate in hand, an estimate of the recall for the classifier of interest may be obtained, since the precision may be inexpensively computed as discussed above. The estimate for the recall of the classifier of interest (i.e., n_new-est) may be obtained using the following equations.

[0085] The following example is for explanatory purposes only and not intended to limit the scope of the invention. In the following example, consider the scenario in which classification is performed to classify which users would be appropriate to target for a particular advertisement campaign. The advertisement campaign is meant to reach a small number of users, such as the targeting based on niche products. One classifier may be based, for example, on the corpus of social media messages produced, while another may be based on the users the account follows. The two classifiers may be used together to determine the recall metric of each classifier (e.g., the fraction of all appropriate targeting candidates each classifier was able to find).

[0086] By way of a more concrete example, consider the scenario in which the advertisement disperser is to transmit targeted advertisement for a luxury car. Specifically, the advertisement disperser targets affluent users. In the example, classifier X may classify based on the user profiles. Namely, classifier X searches for particular attributes in the user profiles that may be an indication of how affluent each user is. Classifier X stores the user identifiers of identified users that have at least a particular income or net worth level in classifier X results. Classifier Y searches for particular keywords in the user profiles that may be an indication of how affluent each user is and stores identified users that have at least the particular income or net worth level in classifier Y results.

[0087] The classifier X results may be used by the analysis module with classifier Y results to estimate the recall of classifier Y results and, thus, the percentage of target users missed by classifier Y. In the example, if the percentage of recall is 30%, then the advertisement disperser may miss 70% of users that may be interested in the luxury car. Because classifier Y missed more than a threshold percentage of targeted users, the estimated recall may be used to revise classifier Y creating a revised classifier. For example, classifier Y may be retrained with an additional training set, may have additional keywords added, may have additional or alternative types of documents added, and/or may be combined with classifier X. If classifier X and classifier Y are combined to create the revised classifier, the recall and precision may be estimated based on the existing classifier X results and classifier Y results in accordance with one or more embodiments of the invention.

[0088] The quality of the estimated recall may be dependent on the degree of independence between classifier X and classifier Y and/or the precision of the classifier X results and the classifier Y results. In other words, the quality of the estimate may be calculated by calculating the degree of independence between the classifier and/or results and/or by estimating the precision. With a greater degree of independence (i.e., more independent) and/or greater precision, the quality increases. [0089] One or more embodiments of the invention may be applied to the social media analysis as follows. Creation of social media updates (e.g., social media messages, status posts) and the volume and/or availability of the updates has resulted in usage for a variety of predictive tasks. For example, the predictive tasks may include analyzing the sentiment around products and brands (and resulting potential sales), inferring the audience for sporting events, estimating popularity of politicians, etc. Thus, one or more embodiments may define independent classifiers that classify social media streams into topics. One or more embodiments may then estimate recall using the technique described above in order to determine the performance of one or both classifiers. Based on the estimated recall, a classifier may be updated and used reclassified the social media updates. The results of the classification may be used to generate metrics describing the social media updates in accordance with the above predictive tasks. The social media metrics may be described in a communication to one or more users, such as another company that uses the results to perform operations.

[0090] FIG. 5 shows an example, in accordance with one or more embodiments of the invention. In particular, FIG. 5 shows an example Venn diagram. In FIG. 5, the region having the set of all documents is labeled as region U (500) and is the entire rectangle. The region having the set of documents actually belonging to the class is the circle labeled region T (502). The region having the set of documents classified by classifier Ci as belonging to the class is the circle labeled region A_t (504). The region having the set of documents classified by classifier C₂ as belonging to the class is the circle labeled region A₂ (506). If Al and A2 satisfy constraints, the recall τι of C₁ may be estimated using the following equation:

In other words, the above equation and FIG. 5 shows the amount to which A₂, the classifier CI is able to find. If CI and C2 are sufficiently independent, the estimate using the above equation reliably conveys the recall r^ Rather than independence over random draws from the entire set of documents, independence over the on-topic and on-topic subsets of U which makes the intuition of the above equation valid in accordance with one or more embodiments of the invention. Embodiments of the invention may be implemented on a computing system. Any combination of mobile, desktop, server, embedded, or other types of hardware may be used. For example, as shown in FIG. 6, the computing system (600) may include one or more computer processor(s) (602), associated memory (604) (e.g., random access memory (RAM), cache memory, flash memory, etc.), one or more storage device(s) (606) {e.g., a hard disk, an optical drive such as a compact disk (CD) drive or digital versatile disk (DVD) drive, a flash memory stick, etc.), and numerous other elements and functionalities. The computer processor(s) (602) may be an integrated circuit for processing instructions. For example, the computer processor(s) may be one or more cores, or micro-cores of a processor. The computing system (600) may also include one or more input device(s) (610), such as a touchscreen, keyboard, mouse, microphone, touchpad, electronic pen, or any other type of input device. Further, the computing system (600) may include one or more output device(s) (608), such as a screen {e.g., a liquid crystal display (LCD), a plasma display, touchscreen, cathode ray tube (CRT) monitor, projector, or other display device), a printer, external storage, or any other output device. One or more of the output device(s) may be the same or different from the input device(s). The computing system (600) may be connected to a network (612) {e.g., a local area network (LAN), a wide area network (WAN) such as the Internet, mobile network, or any other type of network) via a network interface connection (not shown). The input and output device(s) may be locally or remotely {e.g., via the network (612)) connected to the computer processor(s) (602), memory (604), and storage device(s) (606). Many different types of computing systems exist, and the aforementioned input and output device(s) may take other forms.

[0092] Software instructions in the form of computer readable program code to perform embodiments of the invention may be stored, in whole or in part, temporarily or permanently, on a non-transitory computer readable medium such as a CD, DVD, storage device, a diskette, a tape, flash memory, physical memory, or any other computer readable storage medium. Specifically, the software instructions may correspond to computer readable program code that when executed by a processor(s), is configured to perform embodiments of the invention.

[0093] Further, one or more elements of the aforementioned computing system (600) may be located at a remote location and connected to the other elements over a network (612). Further, embodiments of the invention may be implemented on a distributed system having a plurality of nodes, where each portion of the invention may be located on a different node within the distributed system. In one embodiment of the invention, the node corresponds to a distinct computing device. Alternatively, the node may correspond to a computer processor with associated physical memory. The node may alternatively correspond to a computer processor or micro-core of a computer processor with shared memory and/or resources.

[0094] While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments may be devised which do not depart from the scope of the invention as disclosed herein. Accordingly, the scope of the invention should be limited only by the attached claims.

Claims

CLAIMS What is claimed is:

1. A method for document retrieval comprising:

classifying, by a first classifier, a plurality of documents to obtain a first set of resulting documents, wherein the first set of resulting documents correspond to a plurality of users;

sending a communication to each of the plurality of users based on each of the plurality of users corresponding to at least one document in the first set of documents;

classifying, by a second classifier, the plurality of documents to obtain a second set of resulting documents;

estimating, by a computer processor, an amount of the plurality of documents excluded from identification by the first classifier based on the first set of resulting documents and the second set of resulting documents;

revising the first classifier based on the amount to obtain a revised classifier; classifying, by the revised classifier, the plurality of documents to obtain at least one additional document excluded from the first set of documents; and

sending the communication to at least one additional user excluded from the plurality of users based on the at least one additional user corresponding to the at least one additional document.

2. The method of claim 1, wherein the first classifier classifies the plurality of documents based on a first set of constraints, wherein the second classifier classifies a second plurality of documents based on a second set of constraints, and wherein the first set of constraints is within a threshold degree of independence to the second set of constraints.

3. The method of claim 2, wherein the first set of constraints comprises a first keyword, and the second set of constraints comprises a second keyword.

4. The method of claim 1, wherein the communication is an advertisement.

5. The method of claim 1, wherein the plurality of documents comprises a plurality of social media messages.

6. The method of claim 5, wherein the first classifier classifies based on at least one selected from a group consisting of content of each social media message, authorship of each social media message, and connections with a second social media messages.

7. The method of claim 1, wherein the first classifier classifies the plurality of documents based on a first portion of each document, wherein the second classifier classifies the second plurality of documents based on a second portion of each document, and wherein the first portion of each document and the second portion of each document are non-overlapping.

8. A system for recall estimation comprising:

a computing processor;

instructions, that when executed on the computing device processor:

classifies, by a first classifier, a plurality of documents to obtain a first set of resulting documents, wherein the first set of resulting documents correspond to a plurality of users;

sends a communication to each of the plurality of users based on each of the plurality of users corresponding to at least one document in the first set of documents;

classifies, by a second classifier, the plurality of documents to obtain a second set of resulting documents;

estimates, by a computer processor, an amount of the plurality of documents excluded from identification by the first classifier based on the first set of resulting documents and the second set of resulting documents; revises the first classifier based on the amount to obtain a revised classifier;

classifies, by the revised classifier, the plurality of documents to obtain at least one additional document excluded from the first set of documents; and

sends the communication to at least one additional user excluded from the plurality of users based on the at least one additional user corresponding to the at least one additional document.

9. The system of claim 8, wherein the first classifier classifies the plurality of documents based on a first set of constraints, wherein the second classifier classifies a second plurality of documents based on a second set of constraints, and wherein the first set of constraints is within a threshold degree of independence to the second set of constraints.

10. The system of claim 9, wherein the first set of constraints comprises a first keyword, and the second set of constraints comprises a second keyword.

11. The system of claim 8, wherein the communication is an advertisement.

12. The system of claim 8, wherein the plurality of documents comprises a plurality of social media messages.

13. The system of claim 12, wherein the first classifier classifies based on at least one selected from a group consisting of content of each social media message, authorship of each social media message, and connections with a second social media messages.

14. The system of claim 8, wherein the first classifier classifies the plurality of documents based on a first portion of each document, wherein the second classifier classifies the second plurality of documents based on a second portion of each document, and wherein the first portion of each document and the second portion of each document are non-overlapping.

5. A non-transitory computer readable medium comprising computer readable program code for:

16. The non-transitory computer readable medium of claim 15, wherein the first classifier classifies the plurality of documents based on a first set of constraints, wherein the second classifier classifies a second plurality of documents based on a second set of constraints, and wherein the first set of constraints is within a threshold degree of independence to the second set of constraints.

17. The non-transitory computer readable medium of claim 16, wherein the first set of constraints comprises a first keyword, and the second set of constraints comprises a second keyword.

18. The non-transitory computer readable medium of claim 15, wherein the plurality of documents comprises a plurality of social media messages.

19. The non- transitory computer readable medium of claim 19, wherein the first classifier classifies based on at least one selected from a group consisting of content of each social media message, authorship of each social media message, and connections with a second social media messages.

20. The non-transitory computer readable medium of claim 15, wherein the first classifier classifies the plurality of documents based on a first portion of each document, wherein the second classifier classifies the second plurality of documents based on a second portion of each document, and wherein the first portion of each document and the second portion of each document are non- overlapping.