WO2019109255A1

WO2019109255A1 - Method for inferring scholars' temporal location in academic social network

Info

Publication number: WO2019109255A1
Application number: PCT/CN2017/114646
Authority: WO
Inventors: Jie Tang; Kan Wu; Bo Gao; Debing Liu
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2019-06-13
Anticipated expiration: 2020-06-05

Abstract

Embodiments of the present disclosure provide a Space-Time Factor Graph Model (STFGM) incorporating time and space correlations to infer the authors' missing high-resolution affiliations with time in academic social network. What's more, at a personal global level, devising different smoothing methods to bridge the "holes" between years and trim the "glitches" according to different priority goals of increasing information items with the least precision loss or increasing precision with the least information items trimmed, and demonstrating that our STFGM outperforms the baselines 6%-27% in two datasets (Aminer and MAG) and on two precision metrics. Further, the devised smoothing methods can gain 5%-18% items growth with only a minor precision loss about 0.05%-1% or achieve 2%-7% precision increasing with 3%-7% items loss according to different priority settings. At last, two applications are developed based on our inferring model and smoothing method which demonstrate the effectiveness further.

Description

METHOD FOR INFERRING SCHOLARS’ TEMPORAL LOCATION IN ACADEMIC SOCIAL NETWORK

FIELD

The present disclosure relates to a field of social network technology, and more particularly, to a method for inferring scholars’temporal location in academic social network.

BACKGROUND

The tough competition on personalized information services in a variety of domains has driven the demand for more precise user profiling. Just as a saying goes: “You cannot judge of a man till you know his whole story” , exploring the past affiliations a person have been studying or working at different times can help better profile him/her. One common way is through extracting a person’s working or studying experiences from his/her curriculum vitae in personal home page. However, extracting the formatted affiliations with time in an unstructured biography paragraph automatically is still a big challenge.

There are many researches concerning the mobility of scientists, which need a lot of affiliation information of researchers. Traditional methods, such as biographical questionnaires or individual interviews, can only collect limited data. It is necessary to develop a powerful approach to automatically infer one’s affiliation.

Luckily, with the development of the academic social networks such as Aminer and MAG, we find a new breach. We could get an author’s coauthors, affiliations at different times according to his/her papers included. However, there are some obstacles. Our sample statistic of about 1.5 million authors in Aminer academic network shows that 0.55 million authors (accounting for 1/3 of the sample size) don’t have any affiliations in their papers. And in a modern academic network with more than one hundred million authors, it’s often hard to ascribe the papers to the right authors, since there are different authors with the same name, the same author with different name spellings and abbreviations, even different papers with the same name, et al. So we need to infer out the missing information from sparse and noisy data.

Moreover, traditional location inferring methods rarely concern about the time, but people would not tend to stay at one place all their lives. Inferring temporal location is more challenging. What’s more, many location inferring methods only concern at a very coarse granularity with very limited classification labels, e.g., inferring a country or a state/city in a country. However, the number of higher-resolution institutions such as some university, some research institution are huge.

Most existing methods infer users’ affiliation independently with an indirect metrics on neighbors, such as the highest frequent or the geometric median. In the present disclosure, we illustrate how to infer all the affiliation simultaneously by introducing space correlations between authors.

Another challenge which cannot be negligible is that when we infer out all the missing affiliations through authors’ papers, while an author may not publish papers every year, how to fill the missing information in the gaps when there are no papers without injuring overall precision.

SUMMARY

Embodiments of the present disclosure aim to solve at least one of the technical solutions in the related art.

In order to achieve the above object, embodiments of the present disclosure provide a method for inferring scholars’ temporal location in academic social network.

To the best of our knowledge, this is the first time to infer people’s affiliations at different history times, and the present disclosure proposes a Space-Time Factor Graph Model (STFGM) which outperforms baselines 6%-27%in two datasets and on two precision metrics. On personal global level, the devised smoothing methods can gain 5%-18%items growth with a minor precision loss about 0.05%-1%or achieve 2%-7%precision increasing with 3%-7%items loss according to different priority settings. Based on the proposed model and smoothing method, the present disclosure develops an application which can automatically list out a given author’s career experiences and draws the trajectory path in the map. The service will soon be open to public. Based on many single persons’ trajectories inferred, the present disclosure develops another application to show and study the scientist’s group migrations covering a century and get some interesting findings.

Before proceeding, we first introduce two baseline solutions for this problem. The first is based on the idea that the opportunities to work with someone in the same affiliation are much larger than that in different affiliations. Our survey on 2 million randomly selected papers confirmed the assumption which showed that about 71.3%papers have two or more coauthors from the same affiliation. To predict a missing affiliation of an author at a time t, we can count the affiliation that her/his coauthors belong to and simply assign the affiliation with the maximum count to her/him. We call this method as Statistics-Based model.

The binary classification problem is defined as follows.

At time t, each author a is associated with individual featuresvec (a) ^t while each coauthor relation (a_i1, a_i2) is associated with coauthor features vec (a_i1, a_i2) ^t. For each coauthor relation (a_i1, a_i2) , a binary label

is used to indicate whether a_i1 and a_i2 belong to the same affiliation. Then given a training data, a two-class rankSVM model is built based on Maximum likelihood Estimation (MLE) :

is a combination of individual and coauthor features, this approach is called as Pair-wise Comparison Model.

Previous works usually focused on expertise matching, but seldom considered whether an expert would decline the invitation.

The implementation of Statistics-based Model is easy, but the model is coarse and cannot capture the properties and behavior features of the target author and his/her coauthors at the microscopic level. The Pair-wise Comparison Model focuses on the features of the different authors at the microscopic level, thus can fit the training datasets better than the previous one. However, it assumes every instance to be predicted as independent with each other, which is not the truth. In fact, there are many correlations in the instances which can help increase performance.

Embodiments of the present disclosure propose the Space-Time Factor Graph here. Let G= (V, E) denote a undirected graph, where V and E are the sets of nodes and edges, respectively. Each node is a tuple (t, a_i1, a_i2) indicating authors a_i1 and a_i2 co-write a paper at time t. In terms of edges, two types of correlations are considered, namely space correlations and time correlations. Additionally, individual and coauthor features have benn mentioned before. The objective is to maximize the probabilityP (Y|G, X) .

Three kinds of heuristic knowledge incorporated in our Space-Time Factor Graph Model for predicting the affiliation likeness between authors are introduced.

Heuristic knowledge one: the first one is very straightforward, and the respective demographic features of the two authors and the same things they did together must have a direct connection with the likeness between them. For example, if two persons are all undergraduate students, have similar age and coauthored a lot of papers, the topics of their respective papers are close, it can infer with much confidence that they may be in the same school or university.

Heuristic knowledge two: the second one is that if we know author B and author C are with the same affiliation or have many common connections, and the likeness between author A and author B is high, then we can infer that the probability of A and C with the same affiliation is high also. We refer to the correlation between different authors in the same time as space correlation.

Heuristic knowledge three: thirdly, if we know author A and author B are in the same affiliation last year, then A and B may continue be in the same affiliation this year. Moreover, if we know A and B are in the same affiliation next year as well, then the probability of them with the same affiliation this year increases also. We refer to the correlation between different times on the same author pair as time correlation.

A detailed description of the prosed STFGM is given. In STFGM, each tuple of Year t, Author a_i1, Author a_i2 corresponds to an observation instance. We define the same number of hidden binary-valued variables associated with each observation instance representing the relation between the two authors at the time Year t. More concretely, if Author a_i1 has the same affiliation as Author a_i2 at time t, then the value for the hidden variable related to the tuple (t, a_i1, a_i2) is 1. Otherwise, it is 0. Based on the previous heuristic knowledge, we define three factors.

Attribute factor function: It captures the features of each tuple (t, a_i1, a_i2) , including the respective features of the two authors and the concurrent features between them. The function characterizes how the observed tuple features contribute to the likeness of the authors in the tuple. The function is defined as an exponential-linear function:

denotes a classification label (whether the two authors in author pairs <a_i1, a_i2> are in the same affiliation at time t ) ,

is the weighting vector, Φ is the vector of feature functions,

is the corresponding feature vector of the observation tuple (t, a_i1, a_i2) concatenated by the vectors of a_i1 and a_i2’s respective features and shared common features at time t.

Space factor function: The construction of the space factor function is based on the heuristic knowledge two mentioned above which captures the space correlation between the hidden variables in the same time. It is also defined as an exponential-linear function:

denotes neighbors which have space correlations with

C is the number of types of space correlations;

Time factor function: The construction is based on the third heuristic knowledge which captures the time correlation between the hidden variables. It is also defined as an exponential-linear function:

denotes neighbors which have time correlations with

C'is the number of types of time correlations;

Model Learning: Once it is modeled the authors’ attributes, space and time correlations, the next goal is to combine all the factors, observation instances and hidden variables into an unified model. N_S and N_T are reused to denote all the space and time relations sets without confusion. Define

and

which are two sets representing all the observation instances (including instance attributes and relations) and the hidden variables in the model respectively. Directly modeling joint probability P (X, Y) is very difficult, because it needs to model distribution over all the possible values of X. Fortunately, we can compute the joint conditional probability P (Y|X) which avoids computing the annoying P (X) .

G is an aggregation of the factor functions over all the hidden variables.

is the parameter configuration of the model,

is the global normalization term making the joint conditional probability value between 0 and 1. For Y is partially labeled, we define our log-likelihood objective function on the labeled data Y^L. We use Y'|Y^L to denote the label configuration Y'that satisfies all the known labels Y^L.

Additional aspects and advantages of embodiments of present disclosure will be given in part in the following descriptions, become apparent in part from the following descriptions, or be learned from the practice of the embodiments of the present disclosure.

DETAILED DESCRIPTION

Reference will be made in detail to embodiments of the present disclosure. The embodiments described herein are explanatory and illustrative, and is not construed to limit the present disclosure.

Before proceeding, two baseline solutions for this problem are introduced first. The first is based on the idea that the opportunities to work with someone in the same affiliation are much larger than that in different affiliations. Our survey on 2 million randomly selected papers confirmed the assumption which showed that about 71.3%papers have two or more coauthors from the same affiliation. To predict a missing affiliation of an author at a time t, we can count the affiliation that her/his coauthors belong to and simply assign the affiliation with the maximum count to her/him. We call this method as Statistics-Based model.

By merely counting the frequency of affiliations in one’s coauthors, we ignore some helpful information. Intuitively, we can formulate the problem as a multi-label classification where the label classes are the set of available affiliations. However, in our basic statistics of about 2 million paper, the different affiliations exceed 10 thousand. The 1 vs. 10k+ classification model is likely to fail with limited features. In order to relieve this problem, we first restricted the size of candidate affiliations, which only include those of one’s coauthors. Furthermore, we change the multi-label classification problem into binary classification problem: in contrast to predict the affiliation label directly, the model is designed to predict whether the target author belongs to the same affiliation with a given coauthor.

We define the binary classification problem as follow.

At time t, each author a is associated with individual featuresvec (a) ^t while each coauthor relation (a_i1, a_i2) is associated with coauthor featuresvec (a_i1, a_i2) ^t. For each coauthor relation (a_i1, a_i2) , a binary label

is the combination of individual and coauthor features, this approach is called as Pair-wise Comparison Model.

is the weighting vector, Φ is the vector of feature functions,

denotes neighbors which have space correlations with

C is the number of types of space correlations;

denotes neighbors which have time correlations with

C'is the number of types of time correlations;

and

G is an aggregation of the factor functions over all the hidden variables.

is the parameter configuration of the model,

After inferring out an author’s affiliations at different years, there may still exist “holes” and/or “glitches” in discrete years. For example, we may not collect or the author may not publish any papers at some years, previous algorithms cannot infer out the affiliations without observation instances so “holes” appear. Another example, a predicted same-affiliation coauthor may have two or more affiliations at a year and with some not belonging to the query author. At that year we may not easily distinguish which one is wrong so “glitches” arise. Smoothing includes stretching the data to bridge the “holes” and trimming out the “glitches” . They are two inverse processes. Obviously, when we stretch the data, the wrong information may be introduced to reduce our precision; and when we trim the data, the precision increases but the useful information may be lost.

The trim algorithm we used is Local Outlier Factor (LOF) , which can identify density-based local anomaly. We use google map api to change the affiliations into latitude-longitude pairs. Each point into LOF is a tuple with latitude, longitude and time t to make the information smooth in space and time.

Although explanatory embodiments have been shown and described, it would be appreciated by those skilled in the art that the above embodiments cannot be construed to limit the present disclosure, and changes, alternatives, and modifications can be made in the embodiments without departing from scope of the present disclosure.

Claims

A method for inferring scholars’ temporal location in academic social network, comprising: building a two-class rankSVM model based on Maximum likelihood Estimation (MLE) :

wherein,
is a combination of individual and coauthor features, this approach is called as Pair-wise Comparison Model;

capturing an attribute facor function of each tuple (t, a_i1, a_i2) by an exponential-linear function:

wherein, the function characterize how the observed tuple features contribute to the likeness of the authors in the tuple;
denotes a classification label (whether the two authors in author pairs <a_i1, a_i2> are in the same affiliation at time t ) ;
is the weighting vector; Φ is the vector of feature functions;
is the corresponding feature vector of the observation tuple (t, a_i1, a_i2) concatenated by the vectors of a_i1 and a_i2’s respective features and shared common features at time t;

capturing a space factor function by an exponential-linear function:

wherein, the construction of the space factor function is based on the heuristic knowledge two mentioned above which captures the space correlation between the hidden variables in the same time;
denotes neighbors which have space correlations with
C is the number of types of space correlations;

capturing a time factor function by an exponential-linear function:

wherien, the construction of the time factor functor is based on the third heuristic knowledge which captures the time correlation between the hidden variables;
denotes neighbors which have time correlations with
C'is the number of types of time correlations;

obtaining an unified model based on features of author’s attributes, space and time correlations, observation instance and hidden variables, and computing a joint conditional probability P (Y|X) by formula:

wherein,
G is an aggregation of the factor functions over all the hidden variables;
is the parameter configuration of the model,
is the global normalization term making the joint conditional probability value between 0 and 1; for Y is partially labeled, we define our log-likelihood objective function on the labeled data Y^L; Y'|Y^L denotes the label configuration Y'that satisfies all the known labels Y^L.