WO2015065290A1

WO2015065290A1 - Micropost brand monitoring

Info

Publication number: WO2015065290A1
Application number: PCT/SG2014/000508
Authority: WO
Inventors: Tat-Seng CHUA; Hadi AMIRIEBRAHIMABADI; Yan Chen; Anqi CUI
Original assignee: National University of Singapore
Current assignee: National University of Singapore
Priority date: 2013-10-30
Filing date: 2014-10-30
Publication date: 2015-05-07
Anticipated expiration: 2016-04-30

Abstract

Monitoring of social media and microblogs for information relevant to company brands and associated information by the use of web crawlers. A unified framework of fixed and dynamic keywords, known accounts, key users and friend lists are used to identify relevant microposts, tweets, on a particular organisation of interest. This organisational information is classified by relevancy, using learning algorithms, and emerging and evolving topics are identified.

Description

MICROPOST BRAND MONITORING

Cross-reference to Related Applications

[0001] The present application claims the benefit of the Singapore patent application No. 201308079-1 Filed on 30 October 2013, the entire contents of which are incorporated herein by reference for all purposes,

Technical Field

[0002] Embodiments relate generally to information determination devices and information determination methods.

Background

[0003] Social media portals like Twitter, Facebook, and various forum sites contain the everyday thoughts, opinions, and experiences of their online users. Parts of these user generated contents (UGCs) reflect and reveal information about organizations such as the companies, banks, government organizations, and universities etc. Thus, there may be a need for analyzing this data.

Summary

[0004] According to various embodiments, an information determination device may be provided. The infoimation detennination device may include: an account crawler configured to determine data from at least one pre- determined user account; and an information determiner configured to determine information related to a pre-detemiined organization based on the data.

[0005] According to various embodiments, an information determination method may be provided. The information determination method may include: determining data from at least one pre-determined user account; and determining information related to a pre-detemvined organization based on the data.

Brief Description of the Drawings

[0006] In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments are described with reference to the following drawings, in which:

FIG. 1 A and FIG. IB show information determination devices in accordance with various embodiments;

FIG. 1C shows a flow diagram illustrating an information determination method according to various embodiments;

FIG. 2 shows an illustration of websites;

FIG. 3 shows an illustration of a power law correlation;

FIG. 4 shows an illustration of an architecture according to various embodiments; FIG. 5 shows an illustration of learning evolving and emerging topics;

FIG. 6 shows an illustration of a distribution of relevant tweets; FIG. 7, FIG. 8, and FIG. 9 show illustrations of an effect of learning parameters; and

FIG. 10 shows an illustration of an effect of the temporal continuity constraint.

Description

[0007] Embodiments described below in context of the devices are analogously valid for the respective methods, and vice versa. Furthermore, it will be understood that the embodiments described below may be combined, for example, a part of one embodiment may be combined with a pari of another embodiment.

[0008] In this context, the infonnation determination device as described in this description may include a memory which is for example used in the processing carried out in the infonnation determination device. A memory used in the embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a nonvolatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).

[0009] In an embodiment, a "circuit" may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing software stored in a memory, firmware, or any combination thereof. Thus, in an embodiment, a "circuit" may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g. a microprocessor (e.g. a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor). A "circuit" may also be a processor executing software, e.g. any kind of computer program, e.g. a computer program using a virtual machine code such as e.g. Java, Any other kind of implementation of the respective functions which will be described in more detail below may also be understood as a "circuit" in accordance with an alternative embodiment.

[0010] According to various embodiments, devices and methods may be provided for online discovery of events and topics for organizations from social media. Given an organization, a unified framework may be provided to address two issues that have not been tackled to date: (a) crawling more representative distribution of relevant contents and (b) discriminating relevant from irrelevant content for the organization. The current organization or brand monitoring systems use a fixed set of known keywords to crawl micro-posts from social media. This popular strategy results in (a) many missing relevant micro-posts, and (b) many irrelevant micro-posts. The first issue is due to the dynamic nature of the social media contents, while the latter issue is due to the polysemy problem in which the acronyms of organizations are often shared by many entities. For example, NUS is shared between National University of Singapore, National Union of Students and Nu-Skin™ company.

[0011] According to various embodiments, a unified framework may be provided to address the above issues. This framework may utilize multiple aspects of organizations including fixed keywords, known accounts and automatically identified key-users to crawl more relevant data about organizations from social media. Moreover, it effectively may employ content and user information to address the polysemy problem for organizations. Given the automatically identified relevant micro-posts for an organization, an adaptation of online sparse coding algorithms to efficiently learn the topics through time may be provided. Comprehensive experiments show promising results for three different organizations using streaming data obtained from Twitter.

[0012] According to various embodiments, devices and method for discovering topics related to a given organization by automatic identification of the relevant micro-posts (e.g. tweets) and users through time in the context of Microblogs may be provided.

[0013] According to various embodiments, devices and method for social media analytics, brand monitoring, organization monitoring, topic detection, and/ or topic evolution may be provided.

[0014] According to various embodiments, devices and methods for mining the sense of organizations in social media may be provided. Devices and methods according to various embodiments may be referred to as "OrgSense".

[0015] Live tweet streams have been previously used for topic mining and event detection in general contexts. Also, models of burst and hot topic detection have been developed, from automation to temporal patterns.

[0016] Existing approaches use keywords and hashtags (Hashtags are keywords attached to the # symbol to categorize tweets based on their context) to crawl data. While keyword based approaches work well on mining tweets about specific topics, they are restricted to a set of keywords that are maintained manually. Fixed keywords fail to discover a large fraction of relevant information simply due to missing newly-introduced terms within topics and micro-posts without known keywords. Furthermore, fixed keywords may represent several different entities and results in many iixelevant micro- posts. [0017] Mining evolving and emerging topics in the social media content has become a hot research topic recently. An approach may be used to identify emergent keywords and to utilize them to find emerging topics, A term may be defined as emergent if it frequently occurs in the current time but not in the previous times. In another approach, it may be focused on individual users and an LDA-based (Linear discriminant analysis) approach (called Temporal-LDA) may be used that learns topic transition in a sequence of tweets posted by the same user and use it to predict the future distribution of the user's tweets. In yet a further approach, the evolution of topics may be tracked through time. It may be shown that a sparse coding algorithm with the non-negativity constraint is effective for topic modeling in the social media context. A continuity constraint may be introduced. Transient crowd, a short-lived collection of people who directly communicate with each other through social messages like reply and mention of Twitter, may be mined.

[0018] In a definition according to various embodiments, users may be part of the same community as long as they share interest on the same topic (such communities may be referred to as interest communities). Commonly used algorithms may not be effective to mine such interest communities as there may not be any direct conversation between the users in these communities.

[0019] While commonly used approaches may be effective in mining topics in general, they are not designed for addressing the lack of representative data or polysemy issues for entities like organizations. According to various embodiments, an effective framework may be provided to elicit representative amount of data about organizations and dealing with the polysemy issue for organizations in social media. Various embodiments may be focused on the above issues in Microblogs. [0020] FIG. 1A shows an information determination device 100 according to various embodiments. The information determination device 100 may include an account crawler 102 configured to determine data from at least one pre-determined user account. The information determination device 100 may further include an information determiner 104 configured to determine information related to a pre-determined organization based on the data. The account crawler 102 and the information determiner 104 may be connected via a connection 106 (or a plurality of separate connections), for example an electrical or optical connection, for example any kind of cable or bus.

[0021] It will be understood that a crawler, or a Web crawler, as referred to herein, may be software that downloads data automatically from a network, for example from the Internet. A crawler may systematically visit web pages (for example with given configuration and strategy designed by programmers) and download the data of the web pages. The crawler may perform these repetitive tasks at a much higher rate than doing manually. According to various embodiments, the streaming API (application programming interface) of Twitter may be used to crawl tweets. A topic miner may be software that implements a method to discover human-understandable topics from the texts.

[0022] In other words, according to various embodiments, an information determination device may be provided which detennines information related to a predetermined organization based on data which are determined from one or more predetermined user accounts. [0023] According to various embodiments, the account crawler 102 may include or may be or may be included in a known account crawler configured to determine the data from at least one account for the organization.

[0024] According to various embodiments, the account crawler 102 may include or may be or may be included in a key-user

configured to determine the data from at least one account of a key user.

[0025] FIG. IB shows an information determination device 108 according to various embodiments. The information determination device 108 may, similar to the information determination device 100 of FIG. 1A, include an account crawler 102 configured to determine data from at least one pre-determined user account. The information determination device 108 may, similar to the information determination device 100 of FIG. 1A, further include an information determiner 104 configured to detennine information related to a pre-determined organization based on the data. The information determination device 108 may further include a keyword crawler 1 10, like will be described in more detail below. The information determination device 108 may further include a user friend list crawler 112, like will be desciibed in more detail below. The information detenni nation device 108 may further include a classifier 1 14, like will be described in more detail below. The information determination device 108 may further include a topic miner 116, like will be described in more detail below. The information determination device 108 may further include an optimization problem solver 1 18, like will be described in more detail below. The infonnation determination device 108 may further include a trivial topic purging circuit 120, like will be described in more detail below. The account crawler 102, the information determiner 104 the keyword crawler 110, the user friend list crawler 1 12, the classifier 114, the topic miner 1 16, the optimization problem solver 118, and the trivial topic purging circuit 120 may be connected via a connection 122 (or a plurality of separate connections), for example an electrical or optical connection, for example any kind of cable or bus.

[0026] According to various embodiments, the keyword crawler 1 10 may be configured to determine further data based on at least one pre-determined keyword. The information determiner 104 may further be configured to determine the information further based on the further data.

[0027] According to various embodiments, the keyword crawler 110 may include or may be or may be included in a fixed keyword crawler configured to determine the further data based on at least one fixed keyword.

[0028] According to various embodiments, the keyword crawler 110 may include or may be or may be included in a dynamic keyword crawler configured to determine the further data based on at least one dynamic keyword. At least one dynamic keyword may be changed based on processing of the information determination device 108.

[0029] According to various embodiments, the user friend list crawler 112 may be configured to determine a user graph of users in a social relationship with at least one of the organization or each other.

[0030] According to various embodiments, the account crawler 102 may further be configured to determine the at least one pre-determined user account based on the user graph. [0031] According to various embodiments, the classifier 1 14 may be configured to classify the data into data relevant to the pre-determined organization and data irrelevant to the pre-determined organization.

[0032] According to various embodiments, the information determiner 104 may further be configured to determine the information based on the data relevant to the predetermined organization,

[0033] According to various embodiments, the topic miner 1 16 may be configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about the pre-determined organization.

[0034] According to various embodiments, the topic miner 1 16 may be configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about pre-determined organization. The dynamic keyword crawler may further be configured to determine the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.

[0035] According to various embodiments, the account crawler 102 may further be configured to determine the at least one pre-determined user account based on the data relevant to the pre-determined organization.

[0036] According to various embodiments, the classifier 114 may be configured to classify the data based on learning.

[0037] According to various embodiments, the classifier 1 14 may be configured to classify the data based on a support vector machine. [00381 According to various embodiments, the topic miner 116 may be configured to detect whether the determined information related to the pre-determined organization is related to the evolving topic or to the emerging topic based on learning.

[0039] According to various embodiments, the topic miner 116 may be configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic based on solving an optimization problem.

[0040] According to various embodiments, the optimization problem may include a temporal continuity constraint.

[0041] According to various embodiments, the optimization problem may include a sparse matching constraint.

[0042] According to various embodiments, the optimization problem solver 118 may be configured to solve the optimization problem based on a least angle regression.

[0043] According to various embodiments, the trivial topic purging circuit 120 may be configured to remove old topics from evolving topics and emerging topics.

[0044] FIG. 1C shows a flow diagram 124 illustrating an information determination method. In 126, data may be deteixnined from at least one pre-determined user account. In 128, information related to a pre-determined organization may be determined based on the data.

[0045] According to various embodiments, the method may further include determining the data from at least one account for the organization.

[0046] According to various embodiments, the method may further include determining the data from at least one account of a key user. 10047] According to various embodiments, the method may further include: detennining further data based on at least one pre-detennined keyword; and determining the infonnation further based on the further data.

[0048] According to various embodiments, the method may further include determining the further data based on at least one fixed keyword.

[0049] According to various embodiments, the method may further include detennining the further data based on at least one dynamic keyword, wherein the at least one dynamic keyword is changed based on processing of the information determination method.

[0050] According to various embodiments, the method may further include detennining a user graph of users in a social relationship with at least one of the organization or each other.

[0051] According to various embodiments, the method may further include determining the at least one pre-determined user account based on the user graph.

[0052] According to various embodiments, the method may further include classifying the data into data relevant to the pre-determined organization and data irrelevant to the pre-determined organization.

[0053] According to various embodiments, the method may further include determining the information based on the data relevant to the pre-determined organization.

[0054] According to various embodiments, the method may further include detecting whether the detennined infonnation related to the pre-detennined organization is related to an evolving topic or to an emerging topic. [0055] According to various embodiments, the method may further include: detecting whether the determined information related to the pre-determined organization is related lo an evolving topic or to an emerging topic; and detennining the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.

[0056] According to various embodiments, the method may further include detennining the at least one pre-determined user account based on the data relevant to the pre-determined organization.

[0057] According to various embodiments, the method may further include classifying the d ta based on learning.

[0058] According to various embodiments, the method may further include classifying the data based on a support vector machine.

[0059] According to various embodiments, the method may further include detecting whether the determined infonnation related to the pre-detennined organization is related to the evolving topic or to the emerging topic based on learning.

[0060] According to various embodiments, the method may further include detecting whether the determined information related to the pre-detennined organization is related to an evolving topic or to an emerging topic based on solving an optimization problem.

[0061] According to various embodiments, the optimization problem may include a temporal continuity constraint.

[0062] According to various embodiments, the optimization problem may include a sparse matching constraint. [0063] According to various embodiments, the method may further include solving the optimization problem based on a least angle regression.

[0064] According to various embodiments, the method may further include removing old topics from evolving topics and emerging topics.

[0065] Social media portals like Twitter, Facehook, and various forum sites contain the everyday thoughts, opinions, and experiences of their online users. Parts of these user generated contents (UGCs) reflect and reveal information about organizations such as the companies, banks, government organizations, and universities etc. According to various embodiments, devices and methods may be provided for analyzing this data.

[0066] FIG. 2 shows an illustration 200 of websites, for example Optus Online Department on Twitter, see Optus account on Twitter at https://twitter.com Optus. Optus is the second largest telecommunications company in Australia. As an example, FIG. 2 shows the verified Twitter account of the Optus Telecommunication Company. The biography of this account and its activity level indicate that user-centric businesses are spending substantial resources to hear the voice of their customers. In fact, it may be invaluable for such organizations to keep track of their live feedback to discover actionable insights from social media and provide better (personalized) services to their users. According to various embodiments, sophisticated methods and devices may be provided to discover such topics about a given organization from social media contents.

[0067] There may be three key challenges in mining topics for organizations from social media, like will be described in the following:

[0068] A first key challenge may be effective data harvesting: The fu^"si challenge may be about effective crawling of a live and representative distribution of data about organizations. Most current crawling methodologies rely on a fixed list of keywords (a few previously-known keywords) such as the name of the organization to crawl data. However such methodologies cannot cover all the relevant micro-posts and consequently topics about the organization.

[0069] To address this issue, according to various embodiments, the user community of the target organization may be automatically identified and monitored. The rationale of this approach may be based on the power law correlation between the number of users and the number of relevant tweets for organizations.

[00701 PIG- ³ shows an illustration 300 of the power law correlation between the number of users and the number of relevant tweets for three organizations, namely NUS (in plot 302), DBS (in plot 304), and StarHub (in plot 306). The statistics are obtained from 1-year tweets posted for NUS, and 6-month tweets posted for DBS and StarHub organizations.

[0071] FIG. 3 shows that a small number of users of an organization often produce the major portion of relevant content about the organization. To account for this, according to various embodiments, data may be crawled based on multiple aspects of organizations: (a) known accounts, (b) key-users, and (c) fixed keywords of the organization. The known accounts may be a few manually identified official accounts created on social media portals that broadcast news and announcements about the organization; while key-users are a dynamic list of active and influential users of the target organization that should be automatically identified (like will be described in more detail below). The above sources collectively elicit more relevant data for organizations as compared to the fixed keywords used by the current crawling methodologies. [0072] The second key challenge may be micro-post disambiguation: The second challenge is about discriminating relevant from irrelevant micro-posts with respect to the target organization as data streams in. This is a challenging task because of the polysemy problem in which the acronyms of organizations are often shared by many entities in social media. Current systems simply return many irrelevant micro-posts as they don't disambiguate micro-post for organizations that share the same acronym. It is to be noted that users often use the acronym forms instead of the complete names of the organizations in the social media context mainly due to the length limit imposed by social media portals.

[0073] To address this challenge, according to various embodiments, the context of the target organization defined by the current relevant content (keywords and micro-posts) and the user community of the organizations may be utilized. A highly accurate classifier may be provided to predict the relevance of each incoming micro -post to the target organization based its context information.

[0074] The third key challenge may be topic discovery and monitoring: The third challenge is about online clustering of relevant streaming data into coherent set of topics. This is challenging because, with streaming data, new topics as well as the old ones can be introduced or vanished respectively at any point of time.

[0075] To address this challenge, according to various embodiments, the stream of relevant micro-posts may be clustered into emerging and evolving topics. The emerging topics may be the new topics that emerge and potentially become major in a short period of time, while the evolving ones may be those that have been detected previously and are smoothly evolving through time. For the topic modeling purpose, a novel online sparse coding approach with temporal continuity and sparse matching constraints may be provided. The approach according to various embodiments may be linear with respect to the number of input micro-posts. Furthermore, according to various embodiments, a simple purging mechanism may be provided to detect the inactive topics to further improve the performance of topic modeling.

[0076] A dataset of tweets obtained from Twitter for three organizations was created to evaluate our approach. The three organizations are: National University of Singapore (NUS), Development Bank of Singapore (DBS), and StarHub company. The first two organizations are ambiguous (in which NUS is shared between National University of Singapore, National Union of Students, and NU Skin company, and DBS is shared between several organizations like Development Bank of Singapore and Dublin Business School etc), while the third organization (StarHub) is not ambiguous.

[0077] The evaluation results show that key-users are effective factors to elicit more relevant content about organizations from social media. We show that monitoring key- users leads to: (a) higher performance of topic modeling algorithms and (b) earlier detection of emerging topics. In addition, it can be shown that the proposed framework can discriminate relevant micro-posts with a high accuracy of 86.85% for the ambiguous organizations. Also, the performance of topic modeling further improves when topics are allowed to evolve under a temporal continuity constraint according to various embodiments,

[0078] According to various embodiments, a framework may be provided which effectively addresses the data harvesting problem for organizations. It may utilize multiple aspects of organizations to obtain more relevant data about them from social media.

[0079] According to various embodiments, a framework may be provided which effectively resolves the polysemy issue in social media for organizations.

[0080] According to various embodiments, a framework may be provided which provides a novel adaptation of online sparse coding algorithms to mine the emerging and evolving topics for organizations.

[0081] In the following, an overview of the framework according to various embodiments, for example for mining the sense of organizations in social media, will be given.

[0082] FIG. 4 shows an illustration 400 of an architecture according to various embodiments for mining the sense of organizations from social media. A fixed keyword crawler 402, a known account crawler 404, and org key-user crawler 406 may be provided, like will be described in more detail below.

[0083] Given a target organization, the framework according to various embodiments may utilize the several crawlers to obtain potentially relevant data about the organization from social media. The resultant data is given to a classifier 410 to make a real-time judgment about their relevance to the taiget organization. The classifier according to various embodiments may make use of the context of the organization (both content-level information 408 and user-level information 412) provided by the keyword miner 414 and user miner 418 (using a user graph 426 and a friend list crawler 428) components respectively. The relevant data may then be stored in the relevant tweet repository 416. The topic miner component 424 may extract the cunent emerging topics 422 and evolving topics 420 about the organization using the resultant relevant data. Each component will be described in more detail below, and the approach according to various embodiments for each component will be discussed.

[0084] In the following, information acquisition will be described.

[0085] In the following, the fixed keyword crawler 402 will be further described.

Brand monitoring systems may make use of a few manually selected fixed keywords to crawl data for organizations. Examples of fixed known keywords for a given organization are the name of the target organization or its products, the acronym of the organization etc. The fixed keyword crawler 402 may crawl the micro-posts that contain the fixed keywords.

[0086] In the following, the known account crawler 404 will be further described. Similar to fixed keywords, a few known accounts for the target organization (such as the Optus account in FIG. 2) may be manually identified. These may be official accounts of the target organization that act as informers and usually post relevant micro-posts about their organization. These accounts may be given to the known account crawler 404 to be observed.

[0087] In the following, the org key-user crawler 406 will be further described. The org (organization) key-user crawler 406 may be provided with a dynamic list of key-users to be observed. A definition for key-users according to various embodiments will be provided below.

[0088] In the following, the user friend crawler 428 (in other words: friend list crawler 428) will be further described. The user friend list crawler 428 may be used to construct the user graph 426 of the target organization by crawling the social relationships between users who have posted relevant data about the organization. This user graph 426 may evolve over time as new users are identified.

[0089] In the following, the keyword miner 414 will be further described. The keyword miner 414 may utilize an active learning approach to extract temporally-relevant keywords for organization from the recently seen relevant data. These keywords may be considered as dynamic keywords at each point of time and used by the classification component to determine the content-based relevance of the incoming micro-posts.

[0090] In the following, the user miner 418 will be further described. The user miner 418 may identify the user community and the key-users of the organization so that such users may be monitored in order to obtain more relevant data about the organization. The user miner 418 may utilize the user graph 426 and user activity information to rank the users and find key-users of the organization (like will be described in more detail below).

[0091] In the following, the classifier 410 will be further described. The input data obtained by different crawlers may be a mix of relevant and irrelevant data. For example, key-users may also send micro-posts about other subjects like their various life activities. The classification component 410 (in other words: the classifier 410) may utilize the context information to label the input data as relevant or irrelevant to the target organizations.

[0092] In the following, the topic miner 424 will be described. The topic miner component 424 may utilize the relevant tweets to detect and keep track of topics related to the target organization. According to various embodiments, an adaptation of online sparse coding algorithms may be provided to learn the topics in an efficient way. [0093] In the following, mining keywords and organization users will be described and the approach according to various embodiments for mining organization context defined by its content and user community will be described.

[00941 In the following, mining dynamic keywords will be described.

[0095] Dynamic keywords may be those keywords that represent the current discussions about the target organization at each point of time. To identify such keywords, suppose we have two sets of foreground {s}„) and background (s£_aA) tweets at each point of time t. Let sj_or include the recently- seen relevant tweets posted in a short time window of length T, i.e. [t-T,t], while ¾_tt([ includes the irrelevant tweets identified in the same time window, [t-T,t]. In addition, let

be the vocabulary set obtained from Sf_0r. We define the dynamic keywords as a subset of W* words that best represent the current relevant discussions about the organization. Our aim is thus to extract such keywords from W¹.

[0096] For this purpose, we identify the terms of W that have different distributions in S _ov and -¾_afe. A significant difference between the two distributions of a term Wj ^.W¹ in 5^_orand S£_ek signals that ; better represents one of these sets. Those significant terms that have rising frequency in S _or can potentially represent the dynamic keywords.

[0097] Here we utilize the chi-squared test to compare distributions as its calculation is fast and suitable for rapidly evolving social media content. To derive this value for each j W¹, we use the following Equation:

where fi and bj are the normalized term frequency values of Wj in the foreground and background sets respectively and are computed as follows:

,,b k

¾ = ¹⁰⁰ * ¾3F (3)

Vi

where wf^or and w ^afcare the tenn frequency of Wi in Sj_or and irrespectively. Equation (1) assigns higher weights to the terms that frequently occur in S _or, but rarely occur in Sf,_ak. Thus, Equation (1) only takes into account the words Wj with^f > b_/ and assign zero weight to those with f_t <bi. We rank the terms based on their values and consider those with χ² value greater than e (where e = 2,706 which corresponds to p=0.10 significant level of t-test) as the dynamic keywords.

[Θ098] In the following, mining organization users according to various embodiments will be described.

[0099] The framework according to various embodiments may rank the more active and influential users of the organizations in the higher orders, while, in case of ambiguous organizations, discard the users of the other organizations. We define an active user of an organization as the one who sends many relevant micro-posts about the organization, and an influential user as the one who has many followers within the organization and initiates major discussions about the organization. The combination of these measures can be used to rank the users of the target organization with high accuracy. Let G^l be the user graph of the target organization at time t (compare FIG. 4) and U^l={ui .., u_m} be the set of nodes in G^l. We compute the score for each user u,- E U' based on the following Equation at time t:

where «₍ is the total number of relevant tweets posted by u, up to time t, /£, is used in case of ambiguous organizations and indicates the total number of irrelevant tweets that contain the acronym of the target organization posted by u; up to time t (in case of non- ambiguous organization j£_f is 0). The variable is the total number of Uj's followers who exist in U^l, t/„ is the total number of Uj's relevant tweets that have been re-tweeted by other users up to time t, signf.) is the sign function, and τ, φ , and ω are weighting parameters such that x + φ + ω— 1 . For example, these parameters may he empirically set as follows for experiments: r— 0.5, φ = 0.25, ω = 0.25.

[00100] The above equation may rank the user based on the aforementioned three criteria. The top K users may be considered as the key-users of the organization at time t. These users may be passed to the org key-user crawler 406 to be monitored.

[00101] In the following, relevant tweet detection will be described, and it will be elaborated on the classification approach according to various embodiments to address the polysemy problem for ambiguous organizations.

[00102] According to various embodiments, a high quality classifier may be provided to discriminate relevant content for organizations by (a) learning their content relevance and (b) their user information respectively.

[00103] In the following, a learning content-based classifier will be described.

[00104] According to various embodiments, the framework may assign a relevance score to each input data based on its content similarity with the current discussions about the organization. For this purpose, we utilize the dynamic keywords (mined as described further below) because such keywords are good indicators of the current discussions about the organization.

[00105] Formally, let W - fwt w_m} of arbitrary size m contains the dynamic keywords at time t. Also, as before, let S _or be the set of recently-seen relevant tweets over the time window [t-T,t] and Sj_or be the irrelevant tweets in the same time-span [t-T,t] where t is the current time. We utilize W' as the classification features and Sj_orX) S^_ak as training data to discriminate the input streaming data into relevant and irrelevant sets. The dynamic keywords may provide a fast way to prune the huge amount of irrelevant input data as they stream in.

[00106] We take a binary weighting schema to weight the features for each input tweet. That is, given a tweet, we create its m-dimensional feature vector using W¹ as follows: the L"¹ entry of the feature vector is set to 1, if the tweet contains Wj, and 0 otherwise. Any input data with a zero feature vector may be regarded as irrelevant by default. At the end, each test tweet may be assigned a relevance score which represents the content-based relevance score of the tweet.

[00107] As the classification approach, experiments may be done with SV (support vector machine) classifier which is an effective classifier on textual data. As the baseline, Unigram and Bigram features obtained the combination of Sf_0rand _¾_aft tweets may be considered.

[00108] In the following, combining content and user information according to various embodiments will be described.

[00109] Given the context (i.e. content and user information), a final judgment may be made about the relevance of an input tweet to the target organization. The user information for the data obtained from the fixed keyword crawler may be utilized. This may be because the data crawled from the other two crawlers (known account crawler 404 and org key-user crawlers 406) may come from the users who already have high relevance scores to the target organization and therefore it may be desired to ensure the relevance of their content.

[00110] Formally, given a test tweet sf at time t obtained from the fixed keyword crawler, the final score of the tweet may be determined by the linear combination of its content and user score as follows:

Li = a * C_ei + (1 - ft) * (5) whereas for a tweet obtained from the other crawlers we determine its final score by solely considering its content relevance score as follows:

i = O_sl (6) where C_S( 6 [—1,1] may indicate the content-based relevance score of sf and W¾j G [—1,1] may indicate the relevance score of Uj as the author of if (see Equation (4)). The parameter may control the contribution of each of the above scores in labeling the tweet. This parameter may be learnt using development data.

100111] Any incoming tweet with Lj>0 may be considered as relevant, and the rest as irrelevant. The relevant tweet may be added to the relevant tweet repository which will be then utilized in the next iterations. Table 1 illustrates an online classification algorithm according to various embodiments, The effect of the length of the time interval t on the classification performance may be analyzed.

[00112] Table 1 shows an illustration of Algorithm 1 and Classification at time t. Input: Q*: input test data,

T: learning time interval,

a: learning parameter.

Output: L: classification result.

1. le rn SVM classifier with labeled data seen in [t

2. for each e Q* do

3. use the classifier to compute C_Sl

4. if sj contains a fixed keyword

5. compute Wf,

6. Li = a * C_ti + {l - ) * W*₍

7. else

8. Li = C_Si

9. end for

Table 1.

[00113] In the following, an optimization framework according to various embodiments designed for mining organization topics (for example mining evolving and emerging topics) will be desciibed.

[00114] In the following, streaming input data according to various embodiments will be described,

[00115] Assume that, at each point of time t, we receive a set of relevant tweets

e IT^1*" where n^l is the number of relevant tweets at time t and m is the size of vocabulary. We represent each Sj G t¾^m as a term vector of length m weighted by the standard Term Frequency (TF) and Inverted Document Frequency (IDF) as follows:

where C is the normalization factor, TF(i,j) indicates the frequency of the term and IDF(j) indicates the inverted document frequency of Wj. [00116] In the following, live topic learning will be described. [00117] At each point of time, the incoming tweets may be either matched with the already known topics or may potentially represent new emerging topics for the organization.

[00118] Let the non-negative matrix D e M"^1*^^"1 represent the k'^"1 topics found up to time t-1 for the target organization and S^l e R"^1*" indicate the relevant incoming tweets at time t, Given D^1"' and S^l, the problem is to determine the topic matrix at time t, i.e. D^l 6 K"¹* ' . This matrix may include the smooth evolution of the k^1"1 previously known topics (evolving topics) as well as the new topics identified at time t (emerging topics).

[00119] Let S^ev E R^m*nev indicate the tweets of S¹ that can be matched, to a certain level of significance, with a topic in D^1'1, and S^em e E^m*nembe the rest of tweets (these tweets can potentially form the emerging topics) where n -n^ev+n^eni. It will be explained how to decompose S⁴ into these two matrices further below.

[00120] FIG. 5 shows an illustration 500 of learning evolving and emerging topics at time t; wherein the circles represent the topic learning (TL) process. In 502, the enor difference between the topic of the tweet s, and the topics that have been learned up to time t-1 , i.e. D^{1' 1} may be computed. The error difference may be called residual error. In 504, a purging method may be performed which removes a topic from topic set D'^"1 if it is not matched with any tweet Si for 24 hours. In 506, 508, 520 (which may be all together provided as one component), topic modeling may be performed over the evolving tweets S^cv to create D^ev the evolving topics at time t. In 514, 516 (which may be all together provided as one component), topic modeling may be performed over the emerging tweets S^em to create D^em the emerging topics at time t In 510, the results of the above two topic modelers, i.e. D^ev and D^em, may be concatenated to foim the final set of topics at time t, i.e. D\ In 512, the emerging tweets may be clustered into groups to form D^em (this component may be optional or may be removed as (514, 516) may do this, and as such, this component may be there merely for quality purposes). It will be understood that the diagram of FIG. 5 shows what is happening in one time-step (for example the transition from time t-1 to t), and the time is indicated by time axis 518.

[00121] As depicted in FIG. 5, given S^cv, S^em, and D^1"1, we need to solve the following two sub-problems to obtain D^l: (a) how to learn the evolving topics using S^ev and D^1"1 (we indicate the evolving topics by D^ev G iR^T"*^{ft£ 1})_j and (b) how to learn the new emerging topics using S^em (we indicate the emerging topics by D^cm e W^m*^k' ). The topic matrix D'e ^m*kl where k^k'-'+k' may then be achieved by vertical concatenation of D^c and D^cm.

[00122] The following two constraints may be considered when learning the topic matrix D^l:

[00123] - Temporal Continuity constraint: This requirement may constrain D^ov to be a smooth evolution of D^1"1 , and

[00124] - Sparse Matching constraint: This constraint may indicate that each tweet S; can only represent a "few" topics.

[00125] The first constraint may be to prevent dramatic changes in the evolving topics in two consecutive time stamps, whereas the second constraint may be due to the limited length of the tweets. This may be because tweets are limited to 140 characters; this space may be too short to be used for writing about several topics. [00126] Based on the above requirements, the evolving topic matrix D^ev may be learned by minimizing the following optimization problem

(D-"_S X-) _{= ar}g _mi]lD__x (I s-- - DX [ft +μ II D - D^t_1 ||J. +λ || X || ,

(o) s.t. : X > 0, D > 0, II d_j |||< 1 V? e {l.-.k'^"1} where X^ev 6 R^{kt~l* ev} may be the weight matrix and A 6 [0,1] and μ £ [0,1] may be the learning parameters. The first term in the above Equation may be the reconstruction error, while the second and the third terms may represent the two aforementioned constraints respectively. The above topic learning process may optimize the matrix D^ev with respect to D^1"1 and S^ev. It is to be noted that, in this step, no new topic is introduced.

[00127] In contrast to the evolving topics, the emerging topics may be totally new and there may be no prior information about the number of emerging topics. Therefore, the X- eans approach may be utilized to find an initial set of clusters from S^em. X-Means may be an extension of the standard k-means that utilizes the Bayesian Information Criterion (BIC) model to estimate the best number of clusters within a given range. The resultant clusters that have sufficient number of tweets (for example, only the clusters that have more than 20 tweets may be taken into account) may be considered as the emerging topics and their centroid vector may be used to create an initial emerging topic matrix D^,ni1 6 R^m*k' where k' is the number of such centroids. Then, the same approach may be followed as evolving topics to find the optimum value for D^em as follows:

(D^«"\ X*'») = nTg min_DlX || S«"< - DX \\ +μ || D - D'^»« +A || X ]|,

(9) s.t. : X > 0, D > 0, II d_j \\^'i < 1 Vj€ {1-1·'} where X^eme ^ltnem may be the weight matrix. FIG. 5 depicts the overall procedure of learning topics at each point of time. It is to be noted that the above two processes (learning D^ev and D^em) can be performed in parallel to speed up the overall learning process. The purging process in FIG. 5 will be explained further below. [00128] In the following, decomposition of streaming data will be described, [00129] Given the input matrix S^l and the topic matrix D^1'1, it may be desired to decompose S^l into S^ev and S^em matrices. For this, we find the best representation of each Sj 6 S* in terms of D^t_l as follows:

: argmirij || - Ό⁽~^ιχ- ||? +λ

(10) > 0

100130] The resultant vector j e E¾^¾t 'ma indicate the already known topics that best represent the input vector s_\. Using this vector, we compute the representation error of s* on D^1"1 (what we call residual error) as follows:

D*^{" 1} ) =|| _Si - D'-'a-i |j| +A || x_; ||i (11)

[00131] Based on the value of the residual error, the matrix S^l may be decomposed into the two matrices as follows:

[00132] - S^cv: contains all s; e S' with a residual error equal to or smaller than a chosen threshold r/, and

[00133] - S^cm: includes other inputs, i.e. all Sj e S^l with a residual error greater than η. [00134] In the following, purging trivial topics will be described.

[00135] As time passes, some topics may become old and no more discussions arrive about them. Such topics may be safely removed from the topic matrix D^l. There may be different approaches to accomplish this. For example one may directly remove the non- active topics or replace them with a randomly selected input data. According to various embodiments, the first approach may be applied as it better suits the need for keeping the size of the learned topics manageable.

[00136] To do so, the most recent time that each topic is selected as the dominant topic for an input tweet may be stored. This time may be used as a measure to purge the topics, The dominant topic of each Sj may be the topic that has the greatest matching score with Si as compared to all the other topics, i.e. the topic d_j such that j=arg maxj Xjj where is obtained from Equation (10). It is to be noted that the matching score between each d_j and each Si may be deteimined by the (ij)lh entry of the weight matrix X, i.e. xjj, see Equations (8) and (9). According to various embodiments, all the topics that have not been selected as a dominant topic in the past 24 hours may be considered as non-active and are removed from D^l.

[00137] In the following, optimization algorithms and solutions to the resultant optimization problems will be described. A fast online approach to solve the optimization problem of Equation (8) will be described, and the same approach may be used to solve Equation (9). This optimization problem may be in general non-convex, but, it may be shown that, if one of the variables, either D or X is known, optimization with respect to the other variable will be convex. Therefore, a general solution may be to iteratively optimize the objective function by altematingly optimizing with respect to D and X while holding the other fixed.

[00138] If D is fixed, i.e. D=D^t'1, then the problem may be equivalent to a -ij - regularized least square problem and can be efficiently solved by a least angle regression (LARS) method or an alternating direction method. However, when X is fixed, the problem may be a least square problem with quadratic constraints. To solve this problem, an advanced version of the projected gradient approach may be provided. It may be an effective online approach that processes each input data (or a small subset of data) only once. This may be particularly important in the context of social media where the input data can potentially be large at each time.

[00139] If D is fixed, then Equation (8) may be converted to the following problem (for simplicity in notation and exposition, it may be assumed that D=D^ev, S=S^ev _} and X=X^ev):

X = arg mmx |[ S - D'^X ||* +λ |] X \ s.t.: X≥ 0.

[00140] The above Equation may find the optimal value of X and may be solved by a least angle regression (LARS) method. It is to be noted that as x;s are independent, they may be optimized in parallel. However, if X is fixed, i.e. obtained from the above Equation, then Equation(8) may be converted to the following problem:

D = avgniiuD || S - DX f_F +X |j X ||_: +μ || D - D'^"1 ||

(13) s.t. D 0, \\ dj ||.?< l Vj e {I...*'^"1 }

(00141] Given S, X, and D*^'1, let a loss function £(D) be defined as follows:

£(D) =|| S— DX \\² _F +λ II X II, +μ || D - D^1"1 _F (14)

[00142] The projected gradient approach may solve Equation (13) by iteratively obtaining the projected gradients using the following updating rule:

Di+i = P D, - _0iV_D£(D) _{lDj X)} (15) where Dj may indicate D at iteration i, the parameter c¾ may be the step size, and D£(D)f_{Di X}j may be the gradient of C(D) with respect to D₍ see Equation (16), evaluated on D; and X, and may be a projection function defined for the non-negativity constraint, Equation (17):

V_D£(D) = 2SX^R + DXX^R + 2/i(D - D'^{_ 1}) _C16) pr _! _ / ^z f ^ 0

^{1 [Z}> - 0 otherwise <17)

[00143] The disadvantage of the above approach may be that it may be slow and may need the parameter a to be carefully chosen to obtain good results. To resolve these issues, the second order information, the Hessian matrix, may be used to make the updating rule in Equation (15) parameter free with faster convergence. The Hessian matrix may be utilized to obtain the final updating rule as follows:

D_I+1 = [D,- - V_D£(D) _{(DI X]}H^ [£(D)] _(X|] _A8) where the Hessian matrix of £(£>) may be defined as follows:

ΚΜΌ)} = ΧΧ^τ + 2μΙ, _U9) and -¾^"-1[£(D)^_]] may be the inverse of the Hessian matrix evaluated on X. Since the exact calculation of the inverse of the Hessian matrix may be time-consuming for large number topics, the Hessian matrix may be approximated by its diagonal line based on a diagonal approximation method. Table 2 summarizes the detail procedure of computing D¹ and X' given S¹ and D^1"1 according to various embodiments.

Table 2: Algorithm 2, computing D* and X¹ at time t, see TL in Figure 5.

[00144] In the following, the evaluation methodology according to various embodiments will be described.

[00145] It is assessed how the approach according to various embodiments can make a real-time judgment on the relevant keywords, micro-posts, and topics about a given organizations. The framework according to various embodiments is evaluated from two perspectives: performance in (a) identifying the relevant data and (b) modeling the topics about the organization as data streams in.

[00146] In the following, an evaluation metrics for classification will be described.

[00147] The performance of the classifier according to various embodiments for the positive (relevant) and negative (irrelevant) classes may be evaluated based on Precision, Recall and Fl-score metrics:

Precision- = JtectrfT = Fl^" =

NiofceiErf- Nfpitii - [Preci sion- ) + {Reca )

^,_ _{1 =} m± _{20) where N_correct+ is the number of micro-posts that were assigned correct relevant label, N 'labeled* ^{15 trie} number of micro-posts that were labeled as relevant, and / _tota[+is the total number of relevant micro-posts (the same definition applies for the irrelevant class). Fl⁺ and Fl^" arc the classification performances for the relevant and irrelevant classes respectively and therefore Avg-Fl indicates the average classification performance in terms of Fl -score.

[001481 In the following, evaluation metrics for topic learning will be described.

[00149] Two evaluation metrics may be considered to assess the performance of the topic miner component, namely topic detection accuracy, and miss-rate at first detection. The first measure evaluates the topic detection performance in terms of precision and recall, while the second measure evaluates the amount of information (number of tweets) that has been missed before the first automatic detection of each topic. The second measure is important as we need a small miss-rate for earlier prediction of emerging topics.

[00150] Assume that the set 1= {Ii , I_n{ is our topic ground-truth where each ¾ represents a topic and ø( ,·) indicates the set of tweets that are related to the topic ¾. Furthermore, let o_tj and l_tj be the time that the first and last tweet of ¾ were posted respectively (we call these times the origin and the last time for I_j respectively, thus, [0_]j,l]j] shows the life time of Ij). Let dj G D¹ be a topic that was detected at time t such that Oi ≤ t≤ I_jj, We define the closeness between dj and ¾ as follows:

Precision

Rccall^ij

2Precision*³ Recafi

Fl'³ (21) Precision¹^ + Recall where [.| indicates the cardinality of the corresponding set (number of tweets). The value of F1^IJ shows the similarity between the two topics, i.e. F1^1J=1, iff (i.e. if and only if) the two topics contain exactly the same set of tweets, and F1^IJ=0, iff they are disjoint. Topic dj that produces the maximum value of F1^,J for I_j is considered as the most similar topic to I_j (i.e. the best match). The overall performance of topic detection for the topic set I may then be determined as follows:

[00151] As for the second evaluation measure, miss-rate at first detection, the fraction of ø(/,^■) tweets posted before the origin time of dj (that is the best match of Ij) may be considered as the missed tweets and their percentage determines the value of miss rate (MR) for I_j. Formally, the miss rate for Ij is determined with respect to dj and may be defined as follows:

„„..- Is : s e (Ιί ) & ti esiamp(s) < bA

= ^ ^■ (23) where bi is the origin time of dj. The overall miss-rate for the topic set I may then be obtained as follows:

MR = =i=i (24) n

[00152] A good topic miner should have a high topic detection performance, Fl, and a small miss rate, MR.

[00153] In the following, the experimental settings and results will be described. In the following, the terms micro-post, tweet, and data may be used interchangeably. [00154] In the following, data and settings will be described. Three organizations are considered in this study. They are: National University of Singapore (NUS), Development Bank of Singapore (DBS), and StarHub company (StarHub). NUS and DBS are ambiguous organizations (DBS shares its acronym with many organizations like Dublin Business School, Doha British School, and concepts like Defensive Backs etc), while the third organization (StarHub) is not ambiguous.

[00155J The crawlers according to various embodiments utilized the streaming API of twitter to crawl data. Around 10 fixed keywords were manually identified for each of the ambiguous organizations NUS and DBS (including their acronyms) and only one fixed keyword, the term "starhub" itself, for StarHub. Furthermore, the known accounts for each organization (around 5 to 30 accounts for each organization) were manually identified. Only the top K (k=200) key-users of each organization (see above) may be consider and the key-user list may be updated on a daily basis according to Equation (4). Table 3 shows the number of tweets obtained from each of the crawlers and the crawling eriod for the three organizations.

Table 3: Data statistics and crawling period for the three organizations NUS, DBS, and StarHub.

[00156] In the following, ground truth and settings for classification will be described.

[00157] We created a ground-truth of tweets as relevant or irrelevant for each of the three organizations. For this purpose, we considered all the tweets crawled in a time- window of 10 continuous days for each organization and employed a semi-automatic approach to label them as relevant or irrelevant to the target organization.

[00158] To ease the annotation task, we first extracted all the hashtags from the tweet set of each organization. We manually labeled these hashtags as relevant or irrelevant to the target organizations. We then constructed a set of labeled tweets using: (a) all the tweets that contained at least one of the labeled hashtags and (b) all the tweets posted by known accounts of the organizations that contained at least one fixed keyword. We learned an SVM classifier using this training set and utilized it to label the rest of the tweets crawled in the time-window of 10 continuous days. We utilized term Unigrams and user profile information such as user's location and timezone as classification features. In case of low confidence in the classification results, we judged the tweets based on manual annotation. Overall, we obtained 2.5K, 1.5 , and 4.5K relevant tweets foi- NUS, DBS, and StarHub respectively.

[00159] FIG. 6 shows an illustration 600 of the distribution of the relevant tweets in the resultant ground-truth for the three organizations (NUS, DBS, StarHub, in subplots 602, 604, and 606). "Fixed-Known" indicates the number of relevant tweets obtained by the fixed keyword or known account crawlers for the organization, while "overall" indicates the total number of relevant tweets obtained by all the three crawlers. As it is clear, there are many relevant tweets crawled by the key-user crawler. Such tweets can greatly improve the performance of online topic miner algorithms by providing more content information about the topics. We should also note that there is a high overlap between the data obtained by the fixed keyword and known account crawlers. This is to be expected as the tweets posted by the known accounts of organizations are mainly official news about the organization and usually contain the fixed keywords.

[00160] For parameter setting, we use the first three days of the ground-truth as development data to leam the time interval t and linear combination parameter α in Equation (5). We then employed the resultant values to evaluate the classification performance on the remaining seven days.

[00161] In the following, ground truth and settings for topic modeling will be described.

[00162] Similar to the above approach, we conducted a semi-automatic method to construct our topic dataset. For this purpose, we manually identified 45 topics for the three organizations (15 for each organization). For each topic, we identified the hashtags and all the keywords and key-phrases that uniquely identify the topic. Then, for each topic, we found the tweets that are posted within the topic life time and contain at least one topical keyword or key-phrase. We treat these tweets as the relevant tweets to that topic. Table 4 shows the examples of such topics. Our topic dataset covers different events about the organizations and range From small topics of around 50 tweets per topic to topics with more than 1000 tweets.

Table 4. Some sample topics/events and their corresponding organizations. [00163] For parameter setting, we used the first five topics of each organization as development data to tune the temporal continuity parameter μ in Equations (8) and (9) for each organization. We then utilized the resultant μ values to perform the evaluation on the rest of the topics for the target organizations. We also study the effect of this parameter on the performance of our approach. In addition, for parameter setting, we set X |h_at j_{s a} classical normalization factor, in all the experiments where μ is the

number of terms. We also empirically set the threshold for residual error to η = 0.3.

[00164] In the following, experimental results of tweet classification will be described.

[00165] To simulate live data streaming, we ran our online model over one month data (the month that includes the ground truth data) for each organization, while we restricted the evaluation to the tweets in our ground truth dataset.

[00166] As mentioned above, we used the first three days of the ground-truth to learn t and a and employed the obtained values to evaluate the classification performance on the remaining seven days. Table 5 presents the classification performance as average Fl score as described above. The fixed-kw, Unigram, and Bigram rows show the classification performance when we used fixed keywords, term Unigrms, and term Bigrams as classification features respectively (we consider them as baselines). Dynamic- kw shows the classification performance when we used dynamic keywords as the classification features, i.e. the results obtained from Equation (6), while Dynamic-kw + User represents the performance when we used dynamic keywords in conjunction with user information, as shown in Equation (5). Note that, in all the settings, if an input tweet did not contain any classification feature, we treated it as irrelevant. In addition, in all the experiments the two-tailed paired t-test with p <0.01 was used for significance testing. We use the asterisk mark (*) to indicate significant improvement over the best performing baseline.

100167] Table 5 shows an illustration of a Classification performance in terms of Avg- Fl with different types of features and input Classification perfonnance in terms of Avg- Fl with different types of features and input data.

Table 5: Classification performance in terms of Avg-Fl with different types of features and input Classification performance in terms of Avg-Fl with different types of features and input data.

[00168] The results show that the dynamic keywords alone significantly improve the classification perfonnance over the best perfonning baseline. This indicates that our keyword mining algorithm is able to effectively discriminate current relevant keywords from inelevant ones for each organization. We should also note that since the dynamic keyword model has fewer number of features (as compared to the total number of terms or Unigrams), the classification can be performed very fast which is desirable in online settings. In addition, adding user information significantly improves the classification performance. This is because we utilize the entire user activity with respect to each organization, see Equation (4), to judge its input data.

[00169] It is to be noted that the value of is smaller than 0.5 for both NUS and DBS. This was expected because the parameter only affects tweets with fixed keywords (see Algorithm 1) and for such tweets the weight of the user score, i.e. 1-a, is expected to be high. Also, the classification performance is invariant to the parameter a in case of StarHub: as we mentioned above, the parameter only affects tweets with fixed keywords. Such tweets are considered as relevant for non-ambiguous organizations by default (see subplot 604 in FIG. 6),

[00170] FIG. 7 shows an illustration 700 of the effect of learning parameters T and a of the classification performance for NUS.

[00171] FIG. 8 shows an illustration 800 of the effect of learning parameters T and of the classification performance for DBS.

100172] FIG. 9 shows an illustration 900 of the effect of learning parameters T and of the classification performance for StarHub.

[00173] FIG. 7, FIG. 8, and FIG. 9 (in subplots 702, 704, 802, 804, 902, and 904) show the effect of the learning parameters t and a on our model, Dynamic-kw + User, evaluated over the entire ground truth dataset. In each case, we fixed one of the parameters and investigated the effect of the other one. For the fixed parameter, we used the value obtained from the development set. We restricted the interval time to 7 hours, i.e. 1 -3≤}, and the parameter to 0.1 a<\ with learning steps set to 1 hour and 0.1 for t and respectively,

[00174] As FIG. 7, FIG. 8, and FIG. 9 show, greater time intervals (t) increase the classification performance for NUS but causes great reduction in the classification performance for DBS and StarHub. The life time of the topics happening about the organization may affect the classification performance. If the topics are long, increasing the time interval t may not reduce the performance as the old topics are still active, whereas for short topics, increasing t reduces the performance as the old discussions are not active anymore and thus the dynamic keywords extracted from such topics are not useful features to classify the current input data. As FIG. 8, and FIG. 9 show, for NUS and DBS as ambiguous organizations, smaller values of a (i.e. giving less weight to the content relevance score and higher weight to the user score for the tweets with fixed keywords) leads to better performance. This result indicates the important role of user scores to classify tweets with fixed keywords. As mentioned above, the classification performance for non-ambiguous organizations like StarHub is invariant to the parameter a, but will be affected by the learning time interval.

[00175] In the following, experimental results of topic modeling will be described.

[00176] For these experiments, the online topic modeling method according to various embodiments may be applied over the entire dataset for each organization and only restrict the evaluation to the topic dataset.

[00177] We first tune the learning parameter μ (see Equation (8)) for each organization using the development dataset as described above. We then employ the resultant μ to evaluate the topic modeling performances of different approaches on the test topics. In these experiments, we consider the basic Non-Negative Matrix Factorization (NMF) algorithm without the temporal continuity and sparse matching constraints as the baseline (i.e. for the baseline, we set A = 0 and μ = 0 in our optimization framework to obtain the baseline performance).

[00178] Table 6 shows the evaluation results for topic detection in terms of Fl performance (Equation (22)). The Overall column shows the performance when we perform the evaluation over all the relevant input data for the topic modeling puipose, while the Known column shows the corresponding performance when we only use the relevant tweets obtained from fixed keyword or known account crawlers. Baseline (N P) Optimization Framework

Organization known overall known overall

NUS 39.30 40,03* 37.56, μ : 0.3 39.82, /ί:0.3

DBS 64.67 61.42 65.Θ3, ;/ : 0.3 80.64*. μ:0Α

StarHub 42.88 46,84 43.07, μ : 0.3 51.78*. μ:0.Β

Average 48.95 49.43 48.75 S7.41

Table 6: Topic detection Fl performance with known and overall input, the higher values show better performances.

[00179] In almost all the case (except DBS baseline), the overall data results in a higher performance as compared to known data. This improvement was indeed expected as there are many relevant tweets that are not covered by the fixed keywords and known accounts of organizations. The average performance of baseline and our optimization framework increases from 48.95% to 49.43% and 48.75% to 57.41% respectively by utilizing these relevant tweets.

[00180] The optimization framework according to various embodiments outperforms the baseline for DBS and StarHub while its performance for NUS is comparable with the baseline. The average improvement over the baseline is 7.98%, i.e. from 49.43% to 57.41%o, when we utilize the overall input data for topic modeling.

[00181] It is to be noted that the lower Fl performance for NUS and StarHub (as compared to DBS) could be due to the longer life time of NUS' and StarHub's topics as compared to DBS's topics in our dataset. The long topics may reduce the topic modeling performance as such topics may be divided into sub-topics by different algorithms (mainly because of shifts in topics through time). This is while we only have one best match for each topic. [00182] Comparing the average performances of the baseline and the optimization framework according to various embodiments, it may be concluded that topic detection and tracking is more effective if we use the sparsity and temporal continuity constraints for topic mining. In fact, the temporal continuity constraint helps the system to utilize the past information about topics to make a better judgment about the current topics. FIG. 10 shows the effect of this constraint on the overall topic modeling performance.

[00183] FIG. 10 shows an illustration 1000 (including subplots 1002, 1004, and 1006) of an effect of the temporal continuity constraint on the performance of topic modeling for three organizations. The experiments were performed with μ = 0; 0:3; 0:6; 0:9.

[00184] Table 7 shows the evaluation results for the miss-rate at first detection metric (Equation (24)). The lower values of miss-rate indicate that the topic modeling algorithm is able to identify the emerging topics earlier. As the Figure shows, the average miss-rate is lower when we use the overall data instead of only tweets obtained by the fixed keyword or known account crawlers. This suggests that we can detect emerging topic earlier, if we make use of more (relevant) tweets. It may be concluded that the key-user crawler according to various embodiments is an effective resource for early prediction of emerging topics about organizations. The results show that our approach outperforms the baseline by 4.83% reduction in the average miss-rate (from 27.1 1% to 22.28%).

Table 7: Miss rate results for fixed keywords and overall input, the lower values show better performances. [00185] According to various embodiments, given an organization, a framework may be provided to automatically identify relevant micro-posts, topics and users of the organization from social media.

[00186] Previous brand monitoring systems are not designed to address the lack of representative data or polysemy issues for entities like organizations. Various embodiments provide an effective framework to elicit representative amount of data about organizations and dealing with the polysemy issue for organizations in social media.

[00187] In fact, previous systems use a fixed set of known keywords to crawl micro- posts about a given organization from social media. This popular strategy results in (a) many missing relevant micro-posts, and (b) many irrelevant micro-posts. The first issue is due to the dynamic nature of the social media contents, while the latter issue is due to the polysemy problem in which the acronyms of organizations are often shared by many entities (for example, NUS is shared between National University of Singapore, National Union of Students and Nu-Skin™ company).

[00188] The framework according to various embodiments may provide methods to address the above issues by

[00189] - utilizing multiple aspects of organizations to crawl more relevant data. This includes crawling data through fixed keywords, known accounts and automatically identified key-users of the organizations, and

[00190] employing both content and user information to address the polysemy problem for organizations.

[00191] The above approaches lead to effective relevant data harvesting for organizations from social media. This in turn results in (a) higher performance of topic modeling algorithms, and (b) earlier detection of emerging topics for organizations. Also, given the automatically identified relevant micro-posts of an organization, the framework according to various embodiments provide a method to efficiently learn the topics of the organization through time.

[00192] User-centric organizations and businesses are spending substantial resources to identify their customers (user community) and hear their voice (current discussions and topics). The system according to various embodiments may provide live feedback to organizations by automatic discovery of the relevant content about them from social media, identifying their user community in social media, and listening to their key-users. This information may be invaluable for user-centric organizations as they utilize such information to obtain actionable insights from social media.

[00193] The system according to various embodiments may be fed by the social media portals like Twitter and Facebook.

[00194] The following description pertains to further embodiments.

[00195] According to various embodiments, an mformation determination device may be provided. The information determination device may include: an account crawler configured to determine data from at least one pre-determined user account; and an infonnation detenniner configured to determine information related to a pre-determined organization based on the data.

[00196] In other words, according to various embodiments, an infonnation determination device may be provided which determined information related to a predetermined organization based on data which are determined from one or more predetermined user accounts. [00197] According to various embodiments, the account crawler may include or may be a known account crawler configured to detennine the data from at least one account for the organization.

[00198] According to various embodiments, the account crawler may include or may be a key-user crawler configured to detennine the data from at least one account of a key user.

[00199] According to various embodiments, the information determination device may further include a keyword crawler configured to determine further data based on at least one pre-determined keyword, and the information determiner may further be configured to determine the information further based on the further data.

[00200] According to various embodiments, the keyword crawler may include or may be a fixed keyword crawler configured to determine the further data based on at least one fixed keyword,

[00201] According to various embodiments, the keyword crawler may include or may be a dynamic keyword crawler configured to determine the further data based on at least one dynamic keyword, and at least one dynamic keyword may be changed based on processing of the information determination device.

[00202] According to various embodiments, the information determination device may further include a user friend list crawler configured to determine a user graph of users in a social relationship with at least one of the organization or each other.

[00203] According to various embodiments, the account crawler may further be configured to determine the at least one pre-determined user account based on the user graph. [00204] According to various embodiments, the information determination device may further include: a classifier configured to classify the data into data relevant to the pie- determined organization and data irrelevant to the pre-determioed organization.

[00205] According to various embodiments, the information determiner may further be configured to determine the information based on the data relevant to the pre-determined organization.

[00206] According to various embodiments, the information determination device may further include a topic miner configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about the pre-determined organization.

[00207] According to various embodiments, the infoimation determination device may further include: a topic miner configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about pre-detennined organization, and the dynamic keyword crawler may further be configured to deteimine the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.

[00208] According to various embodiments, the account crawler may further be configured to determine the at least one pre-determined user account based on the data relevant to the pre-determined organization.

100209] According to various embodiments, the classifier may be configured to classify the data based on learning.

[00210] According to various embodiments, the classifier may be configured to classify the data based on a support vector machine. [00211] According to various embodiments, the topic miner may be configured to detect whether the determined information related to the pre- determined organization is related to the evolving topic or to the emerging topic based on learning.

[00212] According to various embodiments, the topic miner may be configured to detect whether the determined information related to the pre-detennined organization is related to an evolving topic or to an emerging topic based on solving an optimization problem.

[00213] According to various embodiments, the optimization problem may include a temporal continuity constraint.

[00214] According to various embodiments, the optimization problem may include a sparse matching constraint,

[00215] According to various embodiments, the infoimation determination device may further include an optimization problem solver configured to solve the optimization problem based on a least angle regression.

[00216] According to various embodiments, the information determination device may further include a trivial topic purging circuit configured to remove old topics from evolving topics and emerging topics.

[00217] According to various embodiments, an information determination method may be provided. The information determination method may include: determining data from at least one pre-determined user account; and determining information related to a pre-determined organization based on the data.

[00218] According to various embodiments, the information determination method may further include determining the data from at least one account for the organization. [00219] According to various embodiments, the information determination method may further include determining the data from at least one account of a key user.

[00220] According to various embodiments, the information determination method may further include determining further data based on at least one pre-determined keyword; and determining the information further based on the further data.

[00221] According to various embodiments, the information determination method may further include determining the further data based on at least one fixed keyword.

[00222] According to various embodiments, the information determination method may further include determining the further data based on at least one dynamic keyword, wherein the at least one dynamic keyword is changed based on processing of the information determination method.

[00223] According to various embodiments, the information determination method may further include determining a user graph of users in a social relationship with at least one of the organization or each other.

[00224] According to various embodiments, the information determination method may further include determining the at least one pre-detennined user account based on the user graph.

[00225] According to various embodiments, the information determination method may further include classifying the data into data relevant to the pre-detennined organization and data irrelevant to the pre-determined organization.

[00226] According to various embodiments, the information detennination method may further include determining the information based on the data relevant to the pre- detennined organization. 100227] According to various embodiments, the information determination method may further include detecting whether the determined information related to the pre- deteimined organization is related to an evolving topic or to an emerging topic.

[00228] According to various embodiments, the information determination method may further include detecting whether the determined information related to the predetermined organization is related to an evolving topic or to an emerging topic; and determining the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.

[00229] According to various embodiments, the information determination method may further include determining the at least one pre-determined user account based on the data relevant to the pre-determined organization.

[00230] According to various embodiments, the information determination method may further include classifying the data based on learning.

[00231] According to various embodiments, the information determination method may further include classifying the data based on a support vector machine.

[00232] According to various embodiments, the information determination method may further include detecting whether the determined information related to the predetermined organization is related to the evolving topic or to the emerging topic based on learning.

[00233] According to various embodiments, the information determination method may further include detecting whether the determined information related to the predetermined organization is related to an evolving topic or to an emerging topic based on solving an optimization problem. [00234] According to various embodiments, the optimization problem may include a temporal continuity constraint,

[00235] According to various embodiments, the optimization problem may include a sparse matching constraint.

[00236] According to various embodiments, the information determination method may further include solving the optimization problem based on a least angle regression.

[00237] According to various embodiments, the information determination method may further include removing old topics from evolving topics and emerging topics.

[00238] While the invention has been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.

Claims

Claims What is claimed is:

1. An information determination device comprising:

an account crawler configured to determine data from at least one pre-determined user account; and

. an information determiner configured to determine information related to a predetermined organization based on the data.

2. The information determination device of claim 1,

wherein the account crawler comprises a known account crawler configured to determine the data from at least one account for the organization.

3. The information determination device of claim 1 or 2,

wherein the account crawler comprises a key-user crawler configured to determine the data from at least one account of a key user.

4. The information determination device of any one of statements I to 3, further comprising:

a keyword crawler configured to determine further data based on at least one predetermined keyword; wherein the information determiner is fiirther configured to determine the information further based on the further data.

5. The information determination device of any one of claims I to 4,

wherein the keyword crawler comprises a fixed keyword crawler configured to determine the further data based on at least one fixed keyword.

6. The information deteiraination device of any one of claims 1 to 5,

wherein the keyword crawler comprises a dynamic keyword crawler configured to determine the further data based on at least one dynamic keyword, wherein at least one dynamic keyword is changed based on processing of the information determination device.

7. The information determination device of any one of claims 1 to 6, further comprising:

a user friend list crawler configured to determine a user graph of users in a social relationship with at least one of the organization or each other.

8. The information determination device of claim 7,

wherein the account crawler is further configured to determine the at least one pre-determined user account based on the user graph.

9. The infonnation determination device of any one of claims 1 to 8, further comprising:

a classifier configured to classify the data into data relevant to the pre-determined organization and data irrelevant to the pre-determined organization.

10. The infoimation determination device of claim 9,

wherein the infonnation determiner is further configured to determine the information based on the data relevant to the pre-determined organization,

11. The information determination device of any one of claims 1 to 10, further comprising:

a topic miner configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about the pre-determined organization.

12. The information determination device of claim 6, further comprising:

a topic miner configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about pre-determined organization;

wherein the dynamic keyword crawler is further configured to determine the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.

13. The information determination device of claim 9,

wherein the account crawler is further configured to determine the at least one pre-determined user account based on the data relevant to the pre-determined organization.

14. The information determination device of any one of claims 9, 10 or 13,

wherein the classifier is configured to classify the data based on learning,

15. The infonnation deteimination device of any one of claims 9, 10, 13 or 14,

wherein the classifier is configured to classify the data based on a support vector machine.

16. The information determination device of claim 1 1 ,

wherein the topic miner is configured to detect whether the determined infonnation related to the pre-determined organization is related to the evolving topic or to the emerging topic based on learning.

17. The infonnation determination device of claim 1 1 or 16,

wherein the topic miner is configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic based on solving an optimization problem.

18. The information determination device of claim 17, wherein the optimization problem comprises a temporal continuity constraint.

19. The information determination device of claim 17 or 18,

wherein the optimization problem comprises a sparse matching constraint.

20. The information determination device of any one of claims 17 to 19, further comprising:

an optimization problem solver configured to solve the optimization problem based on a least angle regression.

21. The information determination device of any one of claims 1 1 , 17, 18, 19, or 20 further comprising:

a trivial topic purging circuit configured to remove old topics from evolving topics and emerging topics.

22. An information determination method comprising:

determining data from at least one pre- determined user account; and

deteimining information related to a pre- determined organization based on the data.

23. The information determination method of claim 22, further comprising:

determining the data from at least one account for the organization.

24. The information determination method of claim 22 or 23, further comprising: determining the data from at least one account of a key user.

25. The infonnation determination method of any one of claims 22 to 24, further comprising:

determining further data based on at least one pre- determined keyword; and determining the information further based on the further data,

26. The information determination method of any one of claims 22 to 25, further comprising:

determining the further data based on at least one fixed keyword.

27. The information determination method of any one of claims 22 to 26, further comprising:

determining the further data based on at least one dynamic keyword, wherein the at least one dynamic keyword is changed based on processing of the information determination method.

28. The infonnation determination method of any one of claims 22 to 27, further comprising:

detennining a user graph of users in a social relationship with at least one of the organization or each other.

29. The information determination method of claim 28, further comprising:

determining the at least one pre-detennined user account based on the user graph.

30. The information determination method of any one of claims 22 to 29, further comprising:

classifying the data into data relevant to the pre-determined organization and data irrelevant to the pre-determined organization.

31. The infonnation determination method of claim 30, further comprising:

determining the information based on the data relevant to the pre-detennined organization.

32. The infonnation determination method of an one of claims 22 to 31, further comprising:

detecting whether the determined infonnation related to the pre-determined organization is related to an evolving topic or to an emerging topic.

33. The information determination method of claim 27, further comprising:

detecting whether the determined information related to the pre-detennined organization is related to an evolving topic or to an emerging topic; and detennining the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.

34. The information determination method of claim 30, further comprising:

determining the at least one pre-determined user account based on the data relevant to the pre-determined organization.

35. The information determination method of any one of claims 30, 31 or 34, further comprising:

classifying the data based on learning,

36. The information determination method of any one of claims 30, 31, 34 or 35, further comprising:

classifying the data based on a sup ort vector machine.

37. The information determination method of claim 32, further comprising:

detecting whether the determined infoiraation related to the pre-determined organization is related to the evolving topic or to the emerging topic based on learning. 8. The information determination method of claim 32 or 37, further comprising: detecting whether the determined infoimation related to the pre-determined organization is related to an evolving topic or to an emerging topic based on solving an optimization problem,

39. The information determination method of claim 38, wherein the optimization problem comprises a temporal continuity constraint.

The information determination method of claim 38 or 39,

wherein the optimization problem comprises a sparse matching constraint.

41. The information deteiTnination method of any one of claims 38 to 40, further comprising:

solving the optimization problem based on a least angle regression.

42. The information determination method of any one of claims 32, 38, 39, 40, or 41 further comprising:

removing old topics from evolving topics and emerging topics.