On Node Classification in Dynamic Content-based Networks.

Slides:



Advertisements
Similar presentations
On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Sparsification and Sampling of Networks for Collective Classification
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
1 Fast Dynamic Reranking in Large Graphs Purnamrita Sarkar Andrew Moore.
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Scalable Text Mining with Sparse Generative Models
Database k-Nearest Neighbors in Uncertain Graphs Lin Yincheng VLDB10.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Active Learning for Networked Data Based on Non-progressive Diffusion Model Zhilin Yang, Jie Tang, Bin Xu, Chunxiao Xing Dept. of Computer Science and.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Collective Classification A brief overview and possible connections to -acts classification Vitor R. Carvalho Text Learning Group Meetings, Carnegie.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
INARC Charu C. Aggarwal (I2 Contributions) Scalable Graph Querying and Indexing Task I2.2 Charu C. Aggarwal IBM Collaborators (across all tasks): Jiawei.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
QoS Supported Clustered Query Processing in Large Collaboration of Heterogeneous Sensor Networks Debraj De and Lifeng Sang Ohio State University Workshop.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.
Exploring Social Tagging Graph for Web Object Classification
Sofus A. Macskassy Fetch Technologies
Prepared by: Mahmoud Rafeek Al-Farra
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Probabilistic Data Management
A Consensus-Based Clustering Method
Jiawei Han Department of Computer Science
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
Community Distribution Outliers in Heterogeneous Information Networks
Revision (Part II) Ke Chen
Prepared by: Mahmoud Rafeek Al-Farra
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Adaptive entity resolution with human computation
Revision (Part II) Ke Chen
Discriminative Frequent Pattern Analysis for Effective Classification
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
GANG: Detecting Fraudulent Users in OSNs
Discriminative Probabilistic Models for Relational Data
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

On Node Classification in Dynamic Content-based Networks

Motivation 2 Ke Wang Jiawei Han Jian Pei Kenneth A. Ross “Data Mining” “Databases” “Clustering” “Sequential Pattern” … “Algorithms” … “Sequential Pattern” “Data Mining” “Systems” “Rules” … “Mining” “Efficient” “Association Rules” … Year 2001

Motivation 3 Ke Wang Jiawei Han Jian Pei “Data Mining” “Web” “Sequential Pattern” … “Pattern” “Data Mining” “Stream” “Semantics” … “Association Rules” “Data Mining” “Ranking” “Web” … Year 2002 “Parallel” “Automated” “Data” … “Pattern Mining” … “Clustering” “Distributed” “Databases” “Mining” … Marianne Winslett Xifeng Yan Philip S. Yu

Motivation 4 Ke Wang Jiawei Han Jian Pei “Mining” “Databases” “Clustering” “Sequential Pattern” … “Sequential Pattern” “Mining” “Systems” “Rules” … “Mining” “Efficient” “Association” … Year 2003 “Graph” “Databases” “Sequential Mining” … “Algorithms” “Association Rules” “Clustering” “Wireless” “Web” … “Clustering” “Indexing” “Knowledge” “XML” … Charu Aggarwal Xifeng Yan Philip S. Yu

Motivation l Networks annotated with an increasing amount of text Citation networks, co-authorship networks, product databases with large amounts of text content, etc. Highly dynamic 5 l Node classification Problem Often arises in the context of many network scenarios in which the underlying nodes are associated with content. A subset of the nodes in the network may be labeled. Can we use these labeled nodes in conjunction with the structure for the classification of nodes which are not currently labeled? l Applications

l Information networks are very large Scalable and efficient l Many such networks are dynamic Updatable in real time Self-adaptable and robust l Such networks are often noisy Intelligent and selective l Heterogeneous correlations in such networks Challenges 6 A A B B C C A A B B C C A A B B C C

Outline l Related Works l DYCOS: DYnamic Classification algorithm with cOntent and Structure Semi-bipartite content-structure transformation Classification using a series of text and link- based random walks Accuracy analysis l Experiments NetKit-SRL l Conclusion 7

Related Works l Link-based classification (Bhagat et al., WebKDD 2007) Local iterative Global nearest neighbor l Content-only classification (Nigam et al. Machine Learning 2000) Each object’s own attributes only l Relational classification (Sen et al., Technical Report 2004) Each object’s own attributes Attributes and known labels of the neighbors l Collective classification (Macskassy & Provost, JMLR 2007, Sen et al., Technical Report 2004, Chakrabarti, SIGMOD 1998) Local classification Flexible: ranging from a decision tree to an SVM Approximate inference algorithms Iterative classification Gibbs sampling Loopy belief propagation Relaxation labeling 8

Outline l Related Works l DYCOS: DYnamic Classification algorithm with cOntent and Structure Semi-bipartite content-structure transformation Classification using a series of text and link- based random walks Accuracy analysis l Experiments NetKit-SRL l Conclusion 9

DYCOS in A Nutshell l Node classification in a dynamic environment lDynamic network: the entire network is denoted by G t = (N t, A t, T t ) at time t. lProblem statement: lClassify the unlabeled nodes (N t \ T t ) using both the content and structure of the network for all the time stamps in an efficient and accurate manner 10 Both the structure and the content of the network change over time! t+1 t+2 t

l Text-augmented representation Leveraged for a random walk-based classification model that uses both links and text Two partitions: structural nodes and word nodes Semi-bipartite: one partition of nodes is allowed to have edges either within the set, or to nodes in the other partition. l Efficient updates upon dynamic changes Semi-bipartite Transformation 11

l Random walks over augmented structure Starting node: the unlabeled node to be classified. Structural hop A random jump from a structural node to one of its neighbors Content-based multi-hop A jump from a structural node to another through implicit common word nodes Structural parameter: p s l Classification Classify the starting node with the most frequently encountered class label during the random walks Random Walk-Based Classification 12

l Discriminative keywords A set M t of the top m words with the highest discriminative power are used to construct the word node partition. Gini-index The value of G(w) lies in the range (0, 1). Words with a higher value of gini-index are more discriminative for classification purposes. l Inverted lists Inverted list of keywords for each node Inverted list of nodes for each keyword Gini-Index & Inverted Lists 13

l Probabilistic bound: multi-class classification k classes {C 1, C 2, …, C k } b-accurate Pr[b-accurate] ≥ 1 - (k-1)exp{-lb 2 /2} Analysis 14 l Why do we care? DYCOS is essentially using Monte-Carlo sampling to sample various paths from each unlabeled node. Advantage: fast approach Disadvantage: loss of accuracy Can we present analysis on how accurate DYCOS sampling is? l Probabilistic bound: bi-class classification Two classes C 1 and C 2 E[Pr[C 1 ]] = f 1, E[Pr[C 2 ]] = f 2, f 1 - f 2 = b ≥ 0 Pr[mis-classification] ≤ exp{-lb 2 /2}

Outline l Related Works l DYCOS: DYnamic Classification algorithm with cOntent and Structure Semi-bipartite content-structure transformation Classification using a series of text and link- based random walks Accuracy analysis l Experiments NetKit-SRL l Conclusion 15

Experimental Results l Data sets CORA: a set of research papers and the citation relations among them. Each node is a paper and each edge is a citation relation. A total of 12,313 English words are extracted from the paper titles. We segment the data into 10 synthetic time periods. DBLP: a set of authors and their collaborations Each node is an author and each edge is a collaboration. A total of 194 English words in the domain of computer science are used. We segment the data into 36 annual graphs from year 1975 to year

Experimental Results 17 l NetKit-SRL toolkit An open-source and publicly available toolkit for statistical relational learning in networked data (Macskassy and Provost, 2007). Instantiations of previous relational and collective classification algorithms Configuration Local classifier: domain-specific class prior Relational classifier: network-only multinomial Bayes classifier Collective inference: relaxation labeling l Parameters 1) The number of most discriminative words, m; 2) The size constraint of the inverted list for each keyword a; 3) The number of top content-hop neighbors, q; 4) The number of random walks, l; 5) The length of each random walk, h; 6) Structure parameter, p s. The results demonstrate that DYCOS improves the classification accuracy over NetKit by 7.18% to 17.44%, while reducing the runtime to only 14.60% to 18.95% of that of NetKit.

Experimental Results 18 Classification Accuracy Comparison Classification Time Comparison DYCOS vs. NetKit on CORA

Experimental Results 19 Sensitivity to m, l and h (a=30, p s =70%) Sensitivity to a, m and p s (l=3, h=5) Parameter Sensitivity of DYCOS CORA Data DBLP Data

Experimental Results 20 Dynamic Updating Time: DBLP Time Period12345 Update Time (Sec.) Time Period Update Time (Sec.) Dynamic Updating Time: CORA Time Period Update Time (Sec.)

Outline l Related Works l DYCOS: DYnamic Classification algorithm with cOntent and Structure Semi-bipartite content-structure transformation Classification using a series of text and link- based random walks Accuracy analysis l Experiments NetKit-SRL l Conclusion 21

Conclusion l We propose an efficient, dynamic and scalable method for node classification in dynamic networks. l We provide analysis on how accurate the proposed method will be in practice. l We present experimental results on real data sets, and show that our algorithms are more effective and efficient than competing algorithms. 22

23

Challenges Information networks are very large Scalable and efficient Many such networks are dynamic Updatable in real time Self-adaptable and robust Such networks are often noisy Intelligent and selective Heterogeneous correlations in such networks The correlations between the label of o and the content of o The correlations between the label of o and the contents and labels of objects in the neighborhood of o The correlations between the label of o and the unobserved labels of objects in the neighborhood of o 24

Analysis Lemma Consider two classes with expected visit probabilities of f1 and f2 respectively, such that b = f1−f2 > 0. Then, the probability that the class which is visited the most during the sampled random walk process is reversed to class 2, is given by at most Definition Consider the node classification problem with a total of k classes. We define the sampling process to be b-accurate, if none of the classes whose expected visit probability is less than b of the class with the largest expected visit probability turns out have the largest sampled visit probability. Theorem The probability that the sampling process results in a b-accurate reported majority class is given by at least 25

Experimental Results 26 Classification accuracy comparison: DBLP Classification time comparison: DBLP

Experimental Results 27 Sensitivity to a, l and h Sensitivity to m, l and h Sensitivity to m, a and p s Sensitivity to a, m and p s