Presentation is loading. Please wait.

Presentation is loading. Please wait.

Author Name Disambiguation for Citations Using Topic and Web Correlation.

Similar presentations

Presentation on theme: "Author Name Disambiguation for Citations Using Topic and Web Correlation."— Presentation transcript:

1 Author Name Disambiguation for Citations Using Topic and Web Correlation

2 Prior work Supervised classification approaches: Model all authors’ patterns from a set of training data. Unsupervised Classification approaches: Ambiguous citations are clustered into groups of distinct authors by measuring the similarities between the attributes in the citations.

3 Proposed Approach Topic Correlation Web Correlation Pair-Wise Grouping Algorithm

4 Topic Correlation Build a topic association network 1. 利用 Apriori 算法构造有向图,权值为置信度 (结果为一个超图)。 2. 利用 k-way hypergraph partition 算法,将超图 分解为一些簇。 3. 这些簇叫做 topic association network ,研究 课题的相关强度是 citations 在这个网络中的 距离。

5 Web Correlation Use each title to query a search engine. Filter the URLs of several digital libraries. If two citations appear in the same URL, we use them as an instance of Web correlation.

6 Pair-Wise Grouping Algorithm Generate pairs of citations by using similarity metrics Use the training data to train a binary classifier Apply the classifier to determine whether the pairs are matched Combine the predicted results to group the citations into appropriate clusters. Filter out the pairs that would cause the clusters sparse.

7 Pair-Wise Similarity Metrics similarity metrics for Coauthor, Title, and Venue: 1.CSM 2.MSF Similarity metrics for topic correlation: TSM Similarity metrics for web correlation: MNDF

8 Binary Classifier A binary classifier is used to learn the distribution of pair-wise vectors. The pairs predicted as matched are used to build citation clusters ( constructing an undirected graph).

9 Cluster Filter A threshold is set for choosing which bridges should be removed. A bridge is removed if the numbers of vertices in two separate, but connected, components are above the given threshold.

10 Detecting Ambiguous Author Names in Crowdsourced Scholarly Data

11 Prior Work Name disambiguation has been cast into the problem of clustering a set of publications into profiles such that each profile corresponds to a single author.

12 Name Variations and Citations Extract the name variations from a collection of publications Sort them by number of citations Look at the percentage of the total citations that are attributed to the top name variations.( A high percentage suggests that the name is not ambiguous.)

13 Topic Consistency Leverage the discipline tags crowdsourced from the users of the Scholarometer system Detect different but related disciplines associated with an author name: Map an author’s publications to topics, and measure the similarity between these topics. Derive an author’s topic profile

14 A brief survey of automatic methods for author name disambiguation

15 Two problems Synonyms: the same author may appear under distinct names Polysems: distinct authors may have similar names.

16 Proposed taxonomy

17 Author Grouping Methods Defining a similarity function: 1.Using predefined functions: the Levenshtein distance, Jaccard coefficient, cosine similarity, soft-TFIDF and others. 2.Learning a similarity function: Use the training data to produce a similarity function S from R*R(R: the set of references) to {0, 1}, where 1 means that the two references do refer to the same author and 0 means that they do not. 3.Exploiting graph-based similarity functions: Create a coauthorship graph G=(V, E) for each ambiguous group. The same coauthor names are represented by a vertex, and the weight is related to the amount of articles coauthored by the corresponding author names represented by the two vertices.

18 Author Grouping Methods Clustering Techniques: 1.Partitioning 2.Hierarchical agglomerative clustering 3.density-based clustering 4.Spectral clustering

19 Author assignment methods Classification: Assign the references to their authors using a supervised machine learning technique. Clustering: Use probabilistic techniques to determine the author in a iterative way to fit the model.

20 Explored evidence Citation information: the attributes directly extracted from the citations, such as author/coauthor names, work title, publication venue title, year, and so on. Web information: Data retrieved from the web that is used as additional information about an author publication profile. Implicit evidence: Evidence inferred from visible elements of attributes, such as the latent topics of a citation.

21 Summary of characteristics-Author grouping methods

22 Summary of characteristics-Author assignment methods

23 Open challenges Very little data in the citations Very ambiguous cases -- ambiguous references will have coauthors who have also ambiguous names (especially Asian names) Citations with errors Efficiency Different knowledge areas -- our focus is only about computer science Incremental disambiguation Author profile changes New authors

24 pandasearch 重名问题研究计划 相关论文的阅读,找出最适合当前问题的解决 措施。 着重从 implicit evidence 和 web information (特 别是学者个人主页和 cv )入手。 从效率和准确度两个方向着手,着重准确度。 数据挖掘和机器学习基础知识的学习。

25 pandasearch 重名问题实现计划 Type of approach: author grouping methods– learning a similarity function. Explored evidence: citation information, web information, implicit evidence.

Download ppt "Author Name Disambiguation for Citations Using Topic and Web Correlation."

Similar presentations

Ads by Google