2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Social network partition Presenter: Xiaofei Cao Partick Berg.
gSpan: Graph-based substructure pattern mining
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
N EIGHBORHOOD F ORMATION AND A NOMALY D ETECTION IN B IPARTITE G RAPHS Jimeng Sun, Huiming Qu, Deepayan Chakrabarti & Christos Faloutsos Jimeng Sun, Huiming.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
SCS CMU Proximity Tracking on Time- Evolving Bipartite Graphs Speaker: Hanghang Tong Joint Work with Spiros Papadimitriou, Philip S. Yu, Christos Faloutsos.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
The Shortest Path Problem
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
School of Electronics Engineering and Computer Science Peking University Beijing, P.R. China Ziqi Wang, Yuwei Tan, Ming Zhang.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
Temporal Analysis using Sci2 Ted Polley and Dr. Katy Börner Cyberinfrastructure for Network Science Center Information Visualization Laboratory School.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Uncovering Overlap Community Structure in Complex Networks using Particle Competition Fabricio A. Liang
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Post-Ranking query suggestion by diversifying search Chao Wang.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
DM GROUP MEETING PRESENTATION PLAN Eigenvector-based Centrality Measures For Temporal Networks by D Taylor et.al. Uncovering the Small Community.
Your caption here POLYPHONET: An Advanced Social Network Extraction System from the Web Yutaka Matsuo Junichiro Mori Masahiro Hamasaki National Institute.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Greedy & Heuristic algorithms in Influence Maximization
Clustering of Web pages
HITS Hypertext-Induced Topic Selection
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Presentation transcript:

2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop in conjunction with ACM SIGKDD, SNA-KDD'07 報告人 : 吳建良

Outline Community Motivation Understand research community – recommend collaborations Proposed Apporach Rank the relevance with a random walk approach DBconnect A navigational system to investigate community relations Conclusion 2

What is community? In Graph Theory: Densely connected groups of vertices, with sparser connection between groups In Social Network Analysis: Groups of entities that share similar properties or connect to each other via certain relations 3

Why is community important? Interesting data with community structure: Researcher collaboration, friendship network, WWW, Massive Multi-player on-line gaming, electronic communications… Groups in social networks correspond to social communities, which can be used to understand organizational structure, academic collaboration, shared interests and affinities, etc. 4

Motivation Understand the research network between authors, conferences and topics (rank entities by relevance for given entities) Find and recommend research collaborators for given authors Explore the academic social network 5

Proposed Approach Build bipartite graph in the author-conference space Limitation of traditional bipartite graph model Extend the bipartite model to include co-authorship information Further extend the model to tripartite to include topic information Use random walk with restart on such models 6

An example Author Publication Records in Conferences 7 a, b, c, d, e are authors ac(3) means that author a and c published three papers together in KDD(y) conference

Bipartite model for conference- author social network 8 Weight(edge)=publishing frequency of author in a certain conference Limitation: Fail to represent any co- co-authorships To capture the co-author relations: 1.Add a link between a and c  miss the role of KDD 2.Make the link connecting a and c to KDD  make the random walk infeasible 3.Add additional nodes to represent each co-author relation  impractical, a huge number of such relations

Extend the bipartite model to include co-authorship information Add a virtual level of nodes to replace the conference partition, and add direction to the edges A nodes then connect to their own split relation nodes with the original weight C’ nodes to all author nodes If the A node and C’ node have a co-author relation  edge weight: co-author frequency * a parameter f Otherwise, the edge is weighted as original Set f=k (k is the total author number of a conference) 3f

Further extend the model to tripartite to include topic information Research topic is an important component to differentiate any research community Authors that attend the same conferences might work on various topics 10

Adding topic information Very few conference proceedings have their table of contents included in DBLP Table of contents include session titles Extract relevant topics from DBLP Use paper title, and find frequent co-locations in title text Method Manually select a list of stopwords to remove frequently used but non-topic-related words Ex: Towards, Understanding, Approach, … 11

Adding topic information (cond.) Count frequency of every co-located pairs of stemmed words Select the top 1000 most frequent bi-grams as topics Manually add several tri-grams Ex: World Wide Web, Support Vector Machine, … 12

Random walk on DBLP social network Problem to be solving: Given an author node a A, compute a relevance score for each author b A Simple example: conference-author network G 13 Relational matrix M 3×5

Random walk on DBLP social network (cond.) Normalize M such that every column sum up to 1: Q(M) = col_norm(M), Q(M T ) = col_norm(M T ) Construct the adjacency matrix J of G after normalization 14

Random walk on DBLP social network (cond.) Normalized adjacency matrix J of G 15 Q(M T ) Q(M )Q(M )

A random walk on this graph moves from one node to one of its neighbors based on the probability Probability: proportional to the weight of the edge over the sum of weights of all edges that connect to this node EX: if we start from node SIGMOD, then build u as the start vector u is a one-column vector, consisting of (3+7) elements The value of element corresponding to SIGMOD is set to 1 16 Random walk on DBLP social network (cond.)

u=Ju After step1 of the first iteration, the random walk hits the author nodes with b=1×0.44, d=1×0.33, e=1×0.22 After step2 of the first iteration, the chance that the random walk goes back to SIGMOD is 0.44× × ×0.22 = 0.73, and the other 0.27 goes to the other two conference nodes 17 Random walk on DBLP social network (cond.)

After a few iterations, the vector will converge and gives a stable score to every node However, these scores are always the same no matter where the walk begins Solved by random walk with restart Given a restarting probability c Use another vector v, and the value of element corresponding to SIGMOD is set to 1 In each random walk iteration, the walker goes back to the start node with a restart probability 18 Random walk on DBLP social network (cond.) u=(1-c)u + cv

Random walk with restart algorithm(1) 19 Random walk on DBLP social network (cond.) Input: node α A, a bipartite graph model G, restarting probability c, converge threshold ε. Output: relevance score vector B for author nodes. 1. Compute the adjacency matrices J (n+m) ×(n+m) of G. /* n conferences and m authors */ 2. Initialize v α = 0, set element for α to 1: v α (α) = While ( △ u α > ε ) u α = Ju α u α = (1 − c) u α + cv α 4. Set vector B = u α (n+1:n+m). 5. Return B.

Extend the bipartite model into a directed bipartite graph G'=(C',A,E') A has m author nodes, and C has n conference nodes C' is generated based on C and has n*m nodes Assume every node in C is split into m nodes First generate a matrix M (n*m)×m for directional edges from C' to A Then form a matrix N m×(n*m) for edges from A to C' 20 Random walk on DBLP social network (cond.)

The adjacency matrix J of G‘ Algorithm(2): The random walk with restart algorithm for directed bipartite model 21 Random walk on DBLP social network (cond.)

Extend to the tripartite graph model G''=(C,A,T,E'') Assume n conferences, m authors and l topics in G'‘ Three corresponding matrices: U n×m, V m×l and W n×l The adjacency matrices of G'' after normalization: 22 Random walk on DBLP social network (cond.)

Algorithm(3): The random walk with restart algorithm for tripartite model 23 Random walk on DBLP social network (cond.)

DBLP dataset Download the publication data for conferences from the DBLP website9 in July 2007 It contains more than 300,000 authors, about 3,000 conferences and the selected 1,000 N-gram topics The entire adjacency matrix becomes too big to make the random walk efficient Use the METIS algorithm to partition the large graph into ten subgraphs of about the same size 24

The DBconnect System ntent/dbconnect/ ntent/dbconnect/ A navigational system to investigate the community connections and relations Displaying researcher statistics from academic search engines Providing lists of recommended entities to given authors, topics and conferences 25

The DBconnect System (cond.) Academic Information Conference contribution, earliest publication year and average publication per year H-index is calculated based on information retrieved from Google Scholar Approximate citation numbers Related Conferences Based on author-conference-topic model Related Topics Based on author-conference-topic model 26

The DBconnect System (cond.) Co-authors Co-author name and number of paper Related Researchers Based on the directed bipartite graph model Recommended Collaborators Based on author-conference-topic model Co-authors’ names are not shown here The result implies that the given author shares similar topics and conference experiences with these listed researchers, hence the recommendation 27

The DBconnect System (cond.) Recommended To The recommendation is not symmetric Author A may be recommended as a possible future collaborator to author B but not vice versa EX: Jiawei Han has been recommended as collaborator for 6201 authors, but apparently only a few of them is recommended as collaborators to him The given author has been recommended to the author lists Symmetric Recommendations The author lists have been recommended to the given author 28

Conclusion Extend a bipartite graph model to incorporate co-authorship Propose a random walk with restart approach Find related conferences, authors, and topics for a given entity Present DBconnect system Help explore the relational structure and discover implicit knowledge within the DBLP data collection 29