Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa.

Similar presentations


Presentation on theme: "Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa."— Presentation transcript:

1 Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa

2 Mining di Dati Web Overview Introduction Introduction Web Community Mining Web Community Mining Web log mining on MIS Web log mining on MIS Parallel Data Mining on Pc Cluster Parallel Data Mining on Pc Cluster Performance Evaluation Performance Evaluation Conclusion Conclusion

3 Mining di Dati Web Introduction Proposed two application of web mining: Proposed two application of web mining: 1) Extract web Communities 2) Understand Behaviour of Mobile Internet Users (Usage Mining)

4 Mining di Dati Web Web Community Mining Web Community Web Community def: A web Community is a collection of web pages created by individuals or association that have common interests on a specific topic.

5 Mining di Dati Web Proposed technique Starts from a set o seed Starts from a set o seed Based on RPA Based on RPA Create a Community Chart Create a Community Chart

6 Mining di Dati Web Authorities and Hubs Authority : page with good contents on a topic linked by many good hub pages. Authority : page with good contents on a topic linked by many good hub pages. Hub : page with a list of hyperlink to valuable pages on a topic, that points to good authorities. Hub : page with a list of hyperlink to valuable pages on a topic, that points to good authorities. Community Core = Authority + Hubs Community Core = Authority + Hubs

7 Mining di Dati Web Web Community Mining Algorithm: Algorithm: 1. Seed set 2. Apply RSA to each seed: Built web subgraph and extract (using HITS) hubs and authority. Built web subgraph and extract (using HITS) hubs and authority. 3. Investigate how seed derive other seed as related pages.

8 Mining di Dati Web Example 1. Consider that s derives t as related page and vice versa. “s” and “t” are pointed to by similar set of hubs. “s” and “t” are pointed to by similar set of hubs. 2. Consider that s derives t as related page and but t doesn’t derives s. “t” is pointed to by many different hubs so “t” derives a different set of related pages “t” is pointed to by many different hubs so “t” derives a different set of related pages

9 Mining di Dati Web Observation In this way we define a symmertic derivation relationship for identify Communities. In this way we define a symmertic derivation relationship for identify Communities. Def. Community : Set of pages strongly connected by “s.d.r”. Two Communities are related if a member of one community derives a member of the other community. Two Communities are related if a member of one community derives a member of the other community.

10 Mining di Dati Web Web Community Chart Def. Is a Graph that consist of communities as nodes and weighted edges between nodes. Def. Is a Graph that consist of communities as nodes and weighted edges between nodes. The weight represents the relevance of the community The weight represents the relevance of the community We need a tool to browse Communities We need a tool to browse Communities

11 Mining di Dati Web Web Community Chart(2) Label assigned manually Label assigned manually Box = list of URLs sorted by connectivity score. Box = list of URLs sorted by connectivity score. Def. Connectivity score: Def. Connectivity score: number of derivation relatioship from the node to others node of the community. number of derivation relatioship from the node to others node of the community.

12 Mining di Dati Web Example

13 Mobile Info Search (MIS) NTT laboratories NTT laboratories Goal : provide location aware information from internet collecting, structuring, filtering and organizing. Goal : provide location aware information from internet collecting, structuring, filtering and organizing.

14 Mining di Dati Web kokono There is a database-type resource between user and information souces (online maps,yellow pages, etc.)

15 Mining di Dati Web MIS Functionalities User Location Acquisition User Location Acquisition - GPS,PHS,postal number Location Oriented Robot-Based Search(kokono) Location Oriented Robot-Based Search(kokono) - search documents close to a location - display documents in order of distance written in the doc and user position Location Oriented Meta Search Location Oriented Meta Search - backbone database accessed by CGI programs.

16 Mining di Dati Web Association Rule Mining Support, confidence Support, confidence Hierarchy => Taxonomy Hierarchy => Taxonomy Hierarchy allow to find not only rules specific to a location but also wider area that covers that location. Hierarchy allow to find not only rules specific to a location but also wider area that covers that location. Identify Acces patterns of MIS users. Identify Acces patterns of MIS users. Prefetch information. Prefetch information. Reduce acces time. Reduce acces time. Spatial information gives valuabel information to mobile users. Spatial information gives valuabel information to mobile users.

17 Mining di Dati Web Sequential Rule Mining Sequential Patterns Sequential Patterns Derive how different services are used together. Derive how different services are used together.Example: Define the plan after checking the weather: Submit_weather = Wether Forecast  subimit_shop = Shop Info && shop_web = townpage  Submit_kokono = KOKONOSearch  Submit_map = MAP

18 Mining di Dati Web Parallel DM and Pc Cluster Parallel Apriori Parallel Apriori - nodes keep all candidate itemsets - scan indipendently the dataset - comunicate only at the end of the phase Problem : Too much memory used!!! Solution (Partial) : Hash Partitioned Apriori (HPA). - candidates are partitioned using hash function - candidates are partitioned using hash function - each node buils candidate Itemsets - each node buils candidate Itemsets - a lot of disk I/O when support is small - a lot of disk I/O when support is small

19 Mining di Dati Web Parallel Algorithm for Association Rule Mining Non partitioned generalized (NPGM) Non partitioned generalized (NPGM) Hash Partitioned (HPGM) Hash Partitioned (HPGM) - reduce communications Hierarchical HPGM (H-HPGM) Hierarchical HPGM (H-HPGM) - candidate whoose root is identical allocated on the same node H-HPGM with Fine Grain Duplicates H-HPGM with Fine Grain Duplicates(H-HPGM-FGD) - use remaining free space

20 Mining di Dati Web Performance evaluation Oss. Time increase when support becomes small

21 Mining di Dati Web Conclusion Real web Mining application need high performance computing system Real web Mining application need high performance computing system Pc Cluster with his scalable performance (and high costs) is a promising platform… Pc Cluster with his scalable performance (and high costs) is a promising platform…


Download ppt "Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa."

Similar presentations


Ads by Google