Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Web Mining.
Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.
Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Data Mining Chapter 5 Web Data Mining Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Distributed Computations MapReduce
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
University of Kansas Data Discovery on the Information Highway Susan Gauch University of Kansas.
Overview of Web Data Mining and Applications Part I
Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
National Institute of Science & Technology Algorithm to Find Hidden Links Pradyut Kumar Mallick [1] Under the guidance of Mr. Indraneel Mukhopadhyay ALGORITHM.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Mining Interesting Locations and Travel Sequences from GPS Trajectories IDB & IDS Lab. Seminar Summer 2009 강 민 석강 민 석 July 23 rd,
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
Using Hyperlink structure information for web search.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Collusion-Resistance Misbehaving User Detection Schemes Speaker: Jing-Kai Lou 2015/10/131.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
User Behavior Analysis of Location Aware Search Engine Third international Conference of MDM, 2002 Takahiko Shintani, Iko Pramudiono NTT Information Sharing.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Post-Ranking query suggestion by diversifying search Chao Wang.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Location-based Social Networks 6/11/20161 CENG 770.
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Algorithm to Find Hidden Links [1] ALGORITHM TO FIND HIDDEN LINKS IN A WEB PAGE.
Data mining in web applications
Work plan: content model for the sharing platform
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Federated & Meta Search
Selected Topics: External Sorting, Join Algorithms, …
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Web Mining Department of Computer Science and Engg.
Presentation transcript:

Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa

Mining di Dati Web Overview Introduction Introduction Web Community Mining Web Community Mining Web log mining on MIS Web log mining on MIS Parallel Data Mining on Pc Cluster Parallel Data Mining on Pc Cluster Performance Evaluation Performance Evaluation Conclusion Conclusion

Mining di Dati Web Introduction Proposed two application of web mining: Proposed two application of web mining: 1) Extract web Communities 2) Understand Behaviour of Mobile Internet Users (Usage Mining)

Mining di Dati Web Web Community Mining Web Community Web Community def: A web Community is a collection of web pages created by individuals or association that have common interests on a specific topic.

Mining di Dati Web Proposed technique Starts from a set o seed Starts from a set o seed Based on RPA Based on RPA Create a Community Chart Create a Community Chart

Mining di Dati Web Authorities and Hubs Authority : page with good contents on a topic linked by many good hub pages. Authority : page with good contents on a topic linked by many good hub pages. Hub : page with a list of hyperlink to valuable pages on a topic, that points to good authorities. Hub : page with a list of hyperlink to valuable pages on a topic, that points to good authorities. Community Core = Authority + Hubs Community Core = Authority + Hubs

Mining di Dati Web Web Community Mining Algorithm: Algorithm: 1. Seed set 2. Apply RSA to each seed: Built web subgraph and extract (using HITS) hubs and authority. Built web subgraph and extract (using HITS) hubs and authority. 3. Investigate how seed derive other seed as related pages.

Mining di Dati Web Example 1. Consider that s derives t as related page and vice versa. “s” and “t” are pointed to by similar set of hubs. “s” and “t” are pointed to by similar set of hubs. 2. Consider that s derives t as related page and but t doesn’t derives s. “t” is pointed to by many different hubs so “t” derives a different set of related pages “t” is pointed to by many different hubs so “t” derives a different set of related pages

Mining di Dati Web Observation In this way we define a symmertic derivation relationship for identify Communities. In this way we define a symmertic derivation relationship for identify Communities. Def. Community : Set of pages strongly connected by “s.d.r”. Two Communities are related if a member of one community derives a member of the other community. Two Communities are related if a member of one community derives a member of the other community.

Mining di Dati Web Web Community Chart Def. Is a Graph that consist of communities as nodes and weighted edges between nodes. Def. Is a Graph that consist of communities as nodes and weighted edges between nodes. The weight represents the relevance of the community The weight represents the relevance of the community We need a tool to browse Communities We need a tool to browse Communities

Mining di Dati Web Web Community Chart(2) Label assigned manually Label assigned manually Box = list of URLs sorted by connectivity score. Box = list of URLs sorted by connectivity score. Def. Connectivity score: Def. Connectivity score: number of derivation relatioship from the node to others node of the community. number of derivation relatioship from the node to others node of the community.

Mining di Dati Web Example

Mobile Info Search (MIS) NTT laboratories NTT laboratories Goal : provide location aware information from internet collecting, structuring, filtering and organizing. Goal : provide location aware information from internet collecting, structuring, filtering and organizing

Mining di Dati Web kokono There is a database-type resource between user and information souces (online maps,yellow pages, etc.)

Mining di Dati Web MIS Functionalities User Location Acquisition User Location Acquisition - GPS,PHS,postal number Location Oriented Robot-Based Search(kokono) Location Oriented Robot-Based Search(kokono) - search documents close to a location - display documents in order of distance written in the doc and user position Location Oriented Meta Search Location Oriented Meta Search - backbone database accessed by CGI programs.

Mining di Dati Web Association Rule Mining Support, confidence Support, confidence Hierarchy => Taxonomy Hierarchy => Taxonomy Hierarchy allow to find not only rules specific to a location but also wider area that covers that location. Hierarchy allow to find not only rules specific to a location but also wider area that covers that location. Identify Acces patterns of MIS users. Identify Acces patterns of MIS users. Prefetch information. Prefetch information. Reduce acces time. Reduce acces time. Spatial information gives valuabel information to mobile users. Spatial information gives valuabel information to mobile users.

Mining di Dati Web Sequential Rule Mining Sequential Patterns Sequential Patterns Derive how different services are used together. Derive how different services are used together.Example: Define the plan after checking the weather: Submit_weather = Wether Forecast  subimit_shop = Shop Info && shop_web = townpage  Submit_kokono = KOKONOSearch  Submit_map = MAP

Mining di Dati Web Parallel DM and Pc Cluster Parallel Apriori Parallel Apriori - nodes keep all candidate itemsets - scan indipendently the dataset - comunicate only at the end of the phase Problem : Too much memory used!!! Solution (Partial) : Hash Partitioned Apriori (HPA). - candidates are partitioned using hash function - candidates are partitioned using hash function - each node buils candidate Itemsets - each node buils candidate Itemsets - a lot of disk I/O when support is small - a lot of disk I/O when support is small

Mining di Dati Web Parallel Algorithm for Association Rule Mining Non partitioned generalized (NPGM) Non partitioned generalized (NPGM) Hash Partitioned (HPGM) Hash Partitioned (HPGM) - reduce communications Hierarchical HPGM (H-HPGM) Hierarchical HPGM (H-HPGM) - candidate whoose root is identical allocated on the same node H-HPGM with Fine Grain Duplicates H-HPGM with Fine Grain Duplicates(H-HPGM-FGD) - use remaining free space

Mining di Dati Web Performance evaluation Oss. Time increase when support becomes small

Mining di Dati Web Conclusion Real web Mining application need high performance computing system Real web Mining application need high performance computing system Pc Cluster with his scalable performance (and high costs) is a promising platform… Pc Cluster with his scalable performance (and high costs) is a promising platform…