Distributed Classification in Peer-to-Peer Networks Ping Luo, Hui Xiong, Kevin Lü, Zhongzhi Shi Institute of Computing Technology, Chinese Academy of Sciences.

Slides:

Advertisements

Similar presentations

Energy-Efficient Distributed Algorithms for Ad hoc Wireless Networks Gopal Pandurangan Department of Computer Science Purdue University.

Advertisements

Local L2-Thresholding Based Data Mining in Peer-to-Peer Systems Ran Wolff Kanishka Bhaduri Hillol Kargupta CSEE Dept, UMBC Presented by: Kanishka Bhaduri.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Data Mining Classification: Alternative Techniques

1 Distributed Adaptive Sampling, Forwarding, and Routing Algorithms for Wireless Visual Sensor Networks Johnsen Kho, Long Tran-Thanh, Alex Rogers, Nicholas.

Farnoush Banaei-Kashani and Cyrus Shahabi Criticality-based Analysis and Design of Unstructured P2P Networks as “ Complex Systems ” Mohammad Al-Rifai.

Los Angeles September 27, 2006 MOBICOM Localization in Sparse Networks using Sweeps D. K. Goldenberg P. Bihler M. Cao J. Fang B. D. O. Anderson.

PROMISE: Peer-to-Peer Media Streaming Using CollectCast Mohamed Hafeeda, Ahsan Habib et al. Presented By: Abhishek Gupta.

Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,

Practical Belief Propagation in Wireless Sensor Networks Bracha Hod Based on a joint work with: Danny Dolev, Tal Anker and Danny Bickson The Hebrew University.

Network Coding for Large Scale Content Distribution Christos Gkantsidis Georgia Institute of Technology Pablo Rodriguez Microsoft Research IEEE INFOCOM.

Localized Techniques for Power Minimization and Information Gathering in Sensor Networks EE249 Final Presentation David Tong Nguyen Abhijit Davare Mentor:

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

Dept. of Computer Science & Engineering, CUHK1 Trust- and Clustering-Based Authentication Services in Mobile Ad Hoc Networks Edith Ngai and Michael R.

Dissemination protocols for large sensor networks Fan Ye, Haiyun Luo, Songwu Lu and Lixia Zhang Department of Computer Science UCLA Chien Kang Wu.

Communication-Efficient Distributed Monitoring of Thresholded Counts Ram Keralapura, UC-Davis Graham Cormode, Bell Labs Jai Ramamirtham, Bell Labs.

Ensemble Learning: An Introduction

Building Low-Diameter P2P Networks Eli Upfal Department of Computer Science Brown University Joint work with Gopal Pandurangan and Prabhakar Raghavan.

ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Department of Computer Science Stony Brook University.

Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.

Apr 26th, 2006 Solving Generic Role Assignment Exactly Christian Frank and Kay Römer ETH Zurich, Switzerland.

A Local Facility Location Algorithm Supervisor: Assaf Schuster Denis Krivitski Technion – Israel Institute of Technology.

Distributed Combinatorial Optimization

Xingbo Yu ()ICS280sensors Winter 2005 Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Networks A.ManJhi, S. Nath P. Gibbons CMU.

Mario Čagalj supervised by prof. Jean-Pierre Hubaux (EPFL-DSC-ICA) and prof. Christian Enz (EPFL-DE-LEG, CSEM) Wireless Sensor Networks:

UNIVERSITY OF JYVÄSKYLÄ Resource Discovery in Unstructured P2P Networks Distributed Systems Research Seminar on Mikko Vapa, research student.

Distributed Constraint Optimization * some slides courtesy of P. Modi

PROMISE: Peer-to-Peer Media Streaming Using CollectCast Presented by: Randeep Singh Gakhal CMPT 886, July 2004.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

For Better Accuracy Eick: Ensemble Learning

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.

Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

IPCCC’111 Assessing the Comparative Effectiveness of Map Construction Protocols in Wireless Sensor Networks Abdelmajid Khelil, Hanbin Chang, Neeraj Suri.

Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,

A Distributed Clustering Framework for MANETS Mohit Garg, IIT Bombay RK Shyamasundar School of Tech. & Computer Science Tata Institute of Fundamental Research.

Scalable and Fully Distributed Localization With Mere Connectivity.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.

Energy-Efficient Signal Processing and Communication Algorithms for Scalable Distributed Fusion.

A Peer-to-Peer Approach to Resource Discovery in Grid Environments (in HPDC’02, by U of Chicago) Gisik Kwon Nov. 18, 2002.

Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.

Query Aggregation for Providing Efficient Data Services in Sensor Networks Wei Yu *, Thang Nam Le +, Dong Xuan + and Wei Zhao * * Computer Science Department.

Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada Efficient k-Coverage Algorithms for Wireless Sensor Networks Mohamed Hefeeda.

A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

Computer Network Lab. Integrated Coverage and Connectivity Configuration in Wireless Sensor Networks SenSys ’ 03 Xiaorui Wang, Guoliang Xing, Yuanfang.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.

… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

An overlay for latency gradated multicasting Anwitaman Datta SCE, NTU Singapore Ion Stoica, Mike Franklin EECS, UC Berkeley

03/19/02Scalab Seminar Series1 Finding Good Peers in Peer-to-Peer Networks Ramaswamy N.Vadivelu Scalab, ASU.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Machine Learning: Ensemble Methods

Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

Supporting Fault-Tolerance in Streaming Grid Applications

Paraskevi Raftopoulou, Euripides G.M. Petrakis

Ensemble learning.

Speaker : Lee Heon-Jong

Presentation transcript:

Distributed Classification in Peer-to-Peer Networks Ping Luo, Hui Xiong, Kevin Lü, Zhongzhi Shi Institute of Computing Technology, Chinese Academy of Sciences Presentation by: Satya Bulusu

03/27/2008 Overview Introduction Building Local Classifiers Distributed Plurality Voting Experimental Results Related Works Summary

03/27/2008 Research Motivation Widespread use of P2P networks and sensor networks Data to be analyzed are distributed on nodes of these large-scale dynamic networks Traditional distributed data mining algorithms must be extended to fit this new environment Motivating Examples - P2P anti-spam networks - Automatic organization of web documents in P2P environments A distributed classification algorithm is critical in these applications.

03/27/2008 Research Motivation contd… New Challenges - highly decentralized peers, do not have the notion of clients and servers - including hundreds or thousands of nodes, impossible global synchronization - frequent topology changes caused by frequent failure and recovery of peers Algorithm Requirements - scalability, decentralized in-network processing - communication efficient, local synchronism - fault-tolerance

03/27/2008 Problem Formulation Given:  A connected topology graph  Each peer owns its local training data for classification  Local neighborhood change is informed to each peer real- timely Find:  Classification paradigm in this setting  Including how to train and use a global classifier Objective:  Scalability, communication-efficient, decentralized in-network processing, fault-tolerance Constraints:  Each peer can only communicate with its immediate neighbors  The network topology changes dynamically

03/27/2008 Contributions from this paper An algorithm to build an ensemble classifier for distributed classification in P2P networks by plurality voting on all the local classifiers –Adapt the training paradigm of pasting bites for building local classifiers –An algorithm of (restrictive) Distributed Plurality Voting (DPV) to combine the decisions of local classifiers  Correctness  Optimality Extensive Experimental Evaluation –Communication overhead and convergence time of DPV –Accuracy comparison with centralized classification

03/27/2008 Overview Introduction Building Local Classifiers Distributed Plurality Voting Experimental Results Related Works Summary

03/27/2008 Building Local Classifiers Pasting Bites by Breiman [JML’99] –Generating small bites of the data by importance sampling based on the out-of-bag error of classifiers built so far –Stopping criteria: difference of errors between two successive iteration is below a threshold –Voting uniformly all the classifiers The more data on a local node, the more classifiers generated on it, the more votes it owns.

03/27/2008 Overview Introduction Building Local Classifiers Distributed Plurality Voting Experimental Results Related Works Summary

03/27/2008 Problem Formulation Of DPV Given:  A group of peers in a graph would like to agree on one of options.  Each peer conveys its preference by initializing a voting vector, where is the number of votes on the i-th option.  Local neighborhood change is informed to each peer real-timely Find:  The option with the largest number of votes over all peers: Objective:  Scalability, communication-efficient, decentralized in-network processing, fault-tolerance Constraints:  Each peer can only communicate with its immediate neighbors  The network topology changes dynamically

03/27/2008 An Example Of DPV The third option is the answer.

03/27/2008 Comparison Between DPV and Distributed Majority Voting (DMV, by Wolff et al. [TSMC’04]) DMV Given:  A group of peers in a graph  Each peer conveys its preference by kainitializing a 2- tuple, where stands for the number of the votes for certain option and stands for the number of the total vote on this peer.  The majority ratio DMV Find:  Check whether the voting proportion of the specified option is above : DMV Converted to DPV:  Replacing the 2-tuple on each peer with the voting vector

03/27/2008 Comparison Between DPV and DMV contd… DPV vs. DMV  DPV is a multi-values function while DMV is a binary predicate.  DMV can be solved by converting it to DPV.  However, DMV can only solve 2-option DPV problems. For a d-option DPV problem, pairwise comparisons among all d options must be performed by DMV for times (Multiple Choice Voting [TSMC’04]).  DPV finds the maximally supported option directly, and thus saves a lot of communication overhead and the time for convergence. DPV is the general form of DMV

03/27/2008 Challenges for DPV No central server to add all voting vectors, Only communication between immediate neighbors Dynamic change of not only the network topology but also the local voting vectors Supporting not only one-shot query, but also continuous monitor the current voting result according to the latest network status

03/27/2008 DPV Protocol Overview Assumption: –it includes a mechanism to maintain an un-directional spanning tree for the dynamic P2P network. The protocol performs on this tree (duplicate insensitive). –A node is informed of changes in the status of adjacent nodes. Protocol Overview  Each node performs the same algorithm independently  Specify how nodes initialize and react under different situations: a message received, neighboring node detached or joined, the local voting vector changed  When the node status changes under the above situation, the node notifies this change to the other neighbors only if the condition for sending messages satisfies.  To guarantee that each node in the network converges toward the correct plurality

03/27/2008 The Condition for Sending Messages contd… (5,2,1)+(2,0,0)=(7,2,1) 7-2=5 7-1=6 (8,6,1)+(2,0,0)=(10,6,1) 10-4=4 10-1=9 4 6 The differences between the votes of maximally voted option and any other option decrease. Message Sent

03/27/2008 The Condition for Sending Messages (5,2,1)+(2,0,0)=(7,2,1) 7-2=5 7-1=6 (8,4,1)+(2,0,0)=(10,4,1) 10-4=6 10-1=9 6>5 9>6 The differences between the votes of maximally voted option and all other options do not decrease. No Message Sent

03/27/2008 The Correctness of DPV Protocol All the nodes converge to the same result. The difference between the actual votes of maximally voted option and any other option is not smaller than what the protocol have sent. Then, all the nodes converge to the right result.

03/27/2008 The Optimality of DPV Protocol C 1 is more restrictive than C 2, iff, for any input case if C 1 is true then C 2 is true. C 1 is strictly more restrictive than C 2, iff, C 1 is more restrictive than C 2 and there at least exists an input case such that C 1 is false and C 2 is true. is the most restrictive condition for sending messages to keep the correctness of the DPV protocol. It is the condition, which is the most difficult to satisfy. In this sense, it guarantees the optimality in communication overhead.

03/27/2008 The Extension of DPV Protocol Restrictive Distributed Plurality Voting: It can be used in a classification ensemble in a restrictive manner by leaving out some uncertain instances. output the maximally voted option whose proportion to all the votes is above a user-specified threshold. The new condition for sending messages is based on the spirit of.

03/27/2008 Overview Introduction Building Local Classifiers Distributed Plurality Voting Experimental Results Related Works Summary

03/27/2008 Accuracy of P2P Classification Data: covtype (581012*54, 7 classes) from the UCI database, distributed onto 500 nodes

03/27/2008 The Performance of DPV Protocol Experimental Parameters Difference types of network topology: Power-law Graph, Random Graph, Grid Number of nodes: 500, 1000, 2000, 4000, 8000, option DPV problems Experimental Metrics The average communication overhead for each node The convergence time of the protocol for one-shot query

03/27/2008 The Performance of DPV Protocol contd… DPV 0 vs. RANK (Multiple Choice Voting) 500 nodes Averaging 2000 instances of 7-option plurality voting problems a and b are the largest and second largest options, respectively.

03/27/2008Ping Luo KDD 07 The Performance of DPV Protocol contd… The Scalability of DPV 0 Different number of nodes vs. communication overhead of each node

03/27/2008 Ping Luo KDD 07 The Performance of DPV Protocol (4) The Local Optimality of DPV0 Communication overhead and convergence time under different conditions for sending messages

03/27/2008 Overview Introduction Building Local Classifiers Distributed Plurality Voting Experimental Results Related Works Summary

03/27/2008 Related Work - Ensemble Classifiers Model Combination: (weighted) voting, meta-learning For Centralized Data –applying different learning algorithms with heterogeneous models –applying a single learning algorithm to different versions of the data  Bagging: random sampling with replacement  Boosting: re-weighting of the mis-classified training examples  Pasting Bites: generating small bites of the data by importance sampling based on the quality of classifiers built so far For Distributed Data –distributed boosting by Lazarevic et al. [sigkdd’01] –distributed approach to pasting small bites by Chawla et al. [JMLR’04], which uniformly votes hundreds or thousands of classifiers built on all distributed data sites

03/27/2008 Related Work - P2P Data Mining Primitive Aggregates –Average –Count, Sum –Max, Min –Distributed Majority Voting by Wolff et al. [TSMC’04] P2P Data Mining Algorithms –P2P Association Rule Mining by Wolff et al. [TSMC’04] –P2P K-means clustering by Datta et al. [SDM’06] –P2P L2 threshold monitor by Wolff et al. [SDM’06] –Outlier detection in wireless sensor networks by Branch et al. [ICDCS’06] –A classification framework in P2P networks by Siersdorfer et al. [ECIR’06]  Limitations: local classifiers’ propagations, experiments on 16 peers, only focusing on the accuracy issue, without involving any dynamism of P2P networks.

03/27/2008 Overview Introduction Building Local Classifiers Distributed Plurality Voting Experimental Results Related Works Summary

03/27/2008 Summary Proposed an ensemble paradigm for distributed classification in P2P networks Formalized a generalized Distributed Plurality Voting (DPV) protocol for P2P networks The property of DPV 0  supporting both one-shot query and continuous monitor  theoretical local optimality in terms of communication overhead  outperforms alternative approaches  scale up to large networks

03/27/2008 Q. & A. Acknowledgement