Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enhanced Content Delivery Action 2: Mine the Web Industrial Day Roma, 10 Giugno 2004.

Similar presentations


Presentation on theme: "Enhanced Content Delivery Action 2: Mine the Web Industrial Day Roma, 10 Giugno 2004."— Presentation transcript:

1 Enhanced Content Delivery Action 2: Mine the Web Industrial Day Roma, 10 Giugno 2004

2 ECD - Industrial Day, Roma 10 Giugno 2004 Action 2 - Partners ICAR-CNR, Cosenza KDD & HPC Labs ISTI-CNR, Pisa Dipartimento di Informatica, Università di Pisa

3 ECD - Industrial Day, Roma 10 Giugno 2004 Action 2 – Mine the Web  The project: four Work Packages (Action Coordinator Dott. Fosca Giannotti, ISTI-CNR)  Work Package 2.1. Web Mining (UNIPI, ISTI, ICAR)  WP Coordinator: Dott. Salvatore Ruggieri, Dip. Informatica  Work Package 2.2. Indexing and compression (UNIPI)  WP Coordinator : Prof. Paolo Ferragina, Dip. Informatica  Work Package 2.3. Managing Terabytes (ISTI, ICAR)  WP Coordinator : Dott. Raffaele Perego, ISTI-CNR  Work Package 2.4. Participatory Search Services (UNIPI)  WP Coordinator : Prof. Maria Simi, Dip. Informatica

4 ECD - Industrial Day, Roma 10 Giugno 2004 Action 2 – Mine the Web  The main goals of the ECD Project, content enhancement and delivery, are here pursued in a complementary way w.r.t. Action 1  The focus is on Delivering Enhanced Web Contents to (Communities of) Users:  Exploiting Web Mining to extract knowledge/models that can be used to enhance efficacy and efficiency of the various phases of the information search process  Design, validate and provide efficient and scalable solutions for retrieving, storing, and delivering Web contents to users

5 ECD - Industrial Day, Roma 10 Giugno 2004 Motivations  On-line data grows rapidly:  50+M new pages/day, font: IBM  100+k news, articles/day font: IBM  Databases, digital libraries, etc.  Internet use tracking produces additional interesting data:  Servers logs, WSE logs, network traffic logs  Goldman Sachs estimates (2002): “between 80 and 90 percent of information on the Internet and corporate networks is unstructured”

6 ECD - Industrial Day, Roma 10 Giugno 2004 Motivations  The limits of the current means of access to web contents are becoming clear  Low precision and quality, difficulty of matching users’ subjective relevance  over-abundance of low-quality web material  low covering and freshness much relevant information in the hidden web ranking mechanisms penalize important pages that enter the scene  Difficulties in  managing size, complexity, heterogeneity  identifying Patterns and Trends within huge amounts of unstructured contents Web Mining plays an important role. It allows to synthesize and extract precious information and knowledge

7 ECD - Industrial Day, Roma 10 Giugno 2004 Web Mining  User-Centric View (Client-Side)  discovery of documents on a subject  discovery of semantically related documents or document segments  extraction of relevant knowledge about a subject from multiple sources Web Mining: Exploiting Data Mining techniques with data coming from the Web Data Mining: the process of discovery interesting knowledge from large amount of data stored in databases, data warehouses, or other repositories Goal: assist users or site owners in finding something useful/interesting/relevant  Owner-Centric View (Server-Side)  increasing contact / conversion efficiency (Web marketing)  targeted promotion of goods, services, products, ads  measuring effectiveness of site content / structure  providing dynamic personalized services or content

8 ECD - Industrial Day, Roma 10 Giugno 2004 Web Mining Taxonomy Web Mining Web Usage Mining Web Content Mining Web Structure Mining [27/May/2004:19:24: ] "GET /images/finger.jpg HTTP/1.1" [27/May/2004:19:24: ] "GET /images/logokdd.jpg HTTP/1.1 " [27/May/2004:19:24: ] "GET /didattica/BDM2004/TDM_intro pdf HTTP/1.1" [27/May/2004:19:24: ] "GET /didattica/BDM2004/TDM_intro pdf HTTP/1.1" [27/May/2004:19:24: ] "GET /didattica/BDM2004/TDM_intro pdf HTTP/1.1"

9 ECD - Industrial Day, Roma 10 Giugno 2004 Web Content Mining  Discover semantics of documents by examining  textual content  linkage structure  domain knowledge and meta-data;  user attributes / profiles  Approaches: text mining, document semantic analysis  Discover and extract common schema to capture relevant semantic information form heterogeneous data sources  Approaches:  Web-based query languages: XML + WebSQL + WebML  Multiple-layered databases; Discovery of concept hierarchies

10 ECD - Industrial Day, Roma 10 Giugno 2004 Web Structure Mining  Discovery and Analysis of Site Structures  Analyzing web site structure (viewed as a directed graph) by comparing site graph against patterns discovered from site usage / content data  Automatic site construction based on  correlations among pages  domain knowledge / site description  discovery of concept hierarchies among documents  Co-Citation Analysis  Based on the view that the semantic contents of a document/site is reflected in  documents/sites to which it refers  documents/sites that refer to it  Application: discovery of authoritative pages

11 ECD - Industrial Day, Roma 10 Giugno 2004 Web Usage Mining  Discovery of meaningful patterns from data generated by client-server transactions on one or more Web localities  Web localities may involve one or more Web and/or application servers usually belonging to the same organization  Typical Sources of Data:  automatically generated data stored in web server access logs, referrer logs, proxy logs, agent logs, and client-side cookies  user profiles and/or user ratings  meta-data, page attributes, page content, site structure  e-commerce transaction data

12 ECD - Industrial Day, Roma 10 Giugno 2004 Web Mining Applications  Web Usage Mining  discovering customer preference and behavior  Web personalization / collaborative filtering  adaptive Web sites / improving Web site organization  e-business intelligence, etc.  Web Content Mining  information filtering / knowledge extraction  Web document categorization  discovery of ontologies on the Web, etc.  Web Structure Mining  Finding "Quality" or "authoritative" sites based on linkage and citations IBM CLEVER project Google  Etc.

13 ECD - Industrial Day, Roma 10 Giugno 2004 Some related projects  WebFountain - IBM  WebBase - Stanford DBGroup

14 ECD - Industrial Day, Roma 10 Giugno 2004 WebFountain World-Wide Web, News Forums, Weblogs, etc. Newspapers, Magazines, etc. Customer Electronic Text WebFountain Infrastructure for Advanced Text Analytics Finds patterns, trends and relationships in text Application Examples: Marketing Intelligence Research IBM

15 ECD - Industrial Day, Roma 10 Giugno 2004 WebFountain: an infrastructure for Advanced Text Analytics applications ½ Petabye Cluster capacity 2,000,000,000 Number of pages in store 25,000,000 Number of pages crawled per day 10,000 Number of pages mined per second 3674 Number of 73GB hard drives 1231 Number of CPU’s 250 Number of scientists and researchers who have contributed to WebFountain technology 100 Patents pending 75 Patents issued 70 Megabytes/sec traffic coming in from internet 5 minutes, 22 seconds Time to complete query 5 Number of countries contributing to technology

16 ECD - Industrial Day, Roma 10 Giugno 2004 WebFountain: Reputation Tracking

17 ECD - Industrial Day, Roma 10 Giugno 2004 WebBase Stanford DBgroup

18 ECD - Industrial Day, Roma 10 Giugno 2004 WebBase Challenges  Scalability  crawling  archive distribution  index construction  storage  Consistency  freshness  versions  Dissemination  Archiving  “units”  coordination  IP Management  copy access  link access  access control  Hidden Web  Topic-Specific Collection Building

19 ECD - Industrial Day, Roma 10 Giugno 2004 Action 2 – Mine the Web: application scenario  So far, barely no approach analyzes how a given group of users access the Web, with the aim of exploiting usage information to provide enhanced access to web resources to the users from this group  We think that it is possible to learn from usage data of a group of web users new models and patterns that, in combination with document content and structure, may yield enhanced content access and delivery  better search services, better categorization and document classification services, better question answering services

20 ECD - Industrial Day, Roma 10 Giugno 2004 Action 2 – Mine the Web  Ambitious objective: Exploit the combination of Web data about: USAGE, STRUCTURE, CONTENT originated/accessed by a Virtual Organization, to improve the efficacy and efficiency of the knowledge extraction process from the users point of view  Developing solutions:  Innovative w.r.t. the state of the art  Appropriate for the Web domain

21 ECD - Industrial Day, Roma 10 Giugno 2004 Virtual Organizations Virtual Community Internet

22 ECD - Industrial Day, Roma 10 Giugno 2004 Tracking Virtual Organizations  Tracking the interaction of the virtual community with internet allows us to collect several interesting information  Network Traffic data provide detailed information about:  Usage  Preferred sites, user sessions  Content  Accessed Documents  Structure  From client sessions we can build the usage Web subgraph  By parsing the documents retrieved we can build the corresponding link graph Virtual Community

23 ECD - Industrial Day, Roma 10 Giugno 2004 Tracking Virtual Organizations Link graph Traffic graph Link and Traffic graph Virtual Community

24 ECD - Industrial Day, Roma 10 Giugno 2004 Tracking Virtual Organizations Virtual Community  the sequence of pages visited by a user after a query to a WSE gives us precious information about the subjective relevance of pages w.r.t. query topic Query: www consortium

25 ECD - Industrial Day, Roma 10 Giugno 2004 We need an infrastructure: the Web Object Store (WOS)  A Web Data Management System optimized to efficiently handle content, usage, and structure web data Purpose: Enable (possibly) innovative Web IR and Web Mining research by locally providing a small, but significant, portion of the Web built according to our user- centric view  Manage large collections of  Web pages  Preprocessed Usage data  Structure data  Collected within our virtual community

26 ECD - Industrial Day, Roma 10 Giugno 2004 Related activities: -Clustering s -Caching of Documents and of Query results -Efficient and scalable pattern mining and clustering algorithms -Enhanced compression methods -Clustering/categorizing query results snippets -Clustering XML documents -Etc. WOS and related activities Clustering/Pattern/Classification Web Mining algorithms Efficient and scalable access methods: IXE b-trees, full-text indexes search in compressed data Data cleaning, preprocessing, filtering Population: traffic raw data of our community IXE Crawler Partecipatory search Efficient and scalable storage: IXE persistent objects compression distributed architecture  Persistent store of objects  Web data management system for web content, structure and usage data  Management of data at many abstraction levels  Fast development of new applications  Easy C++ annotation of new persistent objects  Read and write data in tables

27 ECD - Industrial Day, Roma 10 Giugno 2004 WOS data model  HttpRequest (Usage)  Citation (Structure)  Page (Content)  Higher-level abstractions  PageView  Session/Q-Sessions  User

28 ECD - Industrial Day, Roma 10 Giugno 2004 WOS applications  Some innovative applications are currently pursued within our project:  Characterization, on the basis of usage only or usage + contents + structure, of new important emerging sites, or irrelevant sites (e.g., advertising sites);  crucial to instruct the crawler of the community web repository towards fresh, relevant documents while avoiding unimportant documents  Page ranking based also on usage information, for achieving a more accurate and dynamic measurement of document relevance  Recommendation of similar/related documents and keywords, on the basis of combined usage/content analysis  Caching and clustering of web search results

29 ECD - Industrial Day, Roma 10 Giugno 2004 WOS population: usage data (WP 2.1)  Many-to-many interactions  Inter-site user sessions  Massive data  Millions/day HttpRequest  ~1 GB/day raw data  We collected long periods of proxy-level IP traffic originated from SERRA network (domain unipi.it)  The whole University of Pisa

30 ECD - Industrial Day, Roma 10 Giugno 2004 WOS population: content data (WP 2.4)  Methods to gather contents to populate Web Object Store  IXE Crawler  Participatory Search System (main activity this year)  Hidden Web Search

31 ECD - Industrial Day, Roma 10 Giugno 2004 WOS population: content data (WP 2.4)  IXE crawler init get next url get page extract urls initial urls web pages Internet

32 ECD - Industrial Day, Roma 10 Giugno 2004 IXE Crawler  Parallel/distributed crawler  High performance through:  asynchronous I/O (500 connections/thread)  asynchronous DNS resolution  keep-alive connections  multi-threads  URL compression  9 Mb/sec transfer rate (7 times nutch.org crawler)

33 ECD - Industrial Day, Roma 10 Giugno 2004 Participatory search: the idea  Participatory search:  each participant builds an index of the local contents and sends it to a central server  the central server implements a community search service collecting and merging the participants' indexes  A model that fits community needs for dedicated search services  A trade-off between a centralized search model (e.g.: Google), and a distributed approach (e.g.: Gnutella, Kazaa)

34 ECD - Industrial Day, Roma 10 Giugno 2004 Participatory Search CentralizedParticipatoryDistributed Search Index Search results Documents C IC I C IC I C IC I C IC I S C I S C I SC I S C I SC I S C I SC I S C I SC I S C I SC I S C – Crawler I – Indexer S – Search Engine

35 ECD - Industrial Day, Roma 10 Giugno 2004 Participatory Search: benefits  Participants are in charge of  selecting what to index and to publish  when to publish (no need of coordination with an external crawler)  Control on index update and freshness  Publishing of Hidden Web content

36 ECD - Industrial Day, Roma 10 Giugno 2004 Qualitatively, we show that  c’ is shorter than c, if s is compressible  Time( A boost ) = Time ( A ), i.e. no slowdown  A is used as a black-box Storage and access methods: compression (WP 2.2) c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost Key Components: Burrows-Wheeler Transform, Suffix Tree, and a Greedy processing of them Our technique takes a poor compressor A and turns it into a compressor A boost with better performance guarantee

37 ECD - Industrial Day, Roma 10 Giugno 2004 Storage and access methods (WP 2.1 and 2.2)  Repository of URLs  Compressed  Prefix and Suffix search within URLs  Search by hostname, path, file-ext, … select count(*) from … where url LIKE ‘http://%.it/%.asp’  Up to two order of magnitude faster than using sequential scan and B-tree  Space occupacy << B-tree

38 ECD - Industrial Day, Roma 10 Giugno 2004 Storage and access methods: index compression (WP 2.3)  Assigning DocIDs in a clever way could improve the compression factor of traditional variable-[bit/byte] encoding methods by increasing the number of small DGaps.  Clustering property: within each posting lists there are dense zones (i.e. a lot of small DGaps).  Our problem consists of enhancing the Clustering Property of posting lists.

39 ECD - Industrial Day, Roma 10 Giugno 2004 Compression Enhancement

40 ECD - Industrial Day, Roma 10 Giugno 2004 Assignment Performance

41 ECD - Industrial Day, Roma 10 Giugno 2004 Content delivery (WP 2.1, 2.2 and 2.3)  Web Caching  Mining of web/proxy server requests aimed at improving LRU- based document caching (WP 2.1)  Recommendation system  (On line/Off line) Mining of web sessions aimed at profiling users and recommending them related pages (WP 2.1, 2.3)  Transactional Clustering  Clustering specialized on transactional data aimed at categorizing web pages, user sessions, snippet sequences, search engine results (WP 2.1, 2.2)

42 ECD - Industrial Day, Roma 10 Giugno 2004 Content delivery (WP 2.3)  SUGGEST: a recommendation system made up of two distinct modules  Offline: performing model extraction by a clustering algorithm which partition the Usage Graph  Online: performing users classification and suggestion generation  The WOS remarkably shortened implementation time (< 500 C++ lines)  We used three WOS objects to produce a persistent clustering structure Citation PageView Session sCluster

43 ECD - Industrial Day, Roma 10 Giugno 2004 Content delivery (WP 2.2) Goal: Retrieve the pages which match the user needs. This is a much difficult task in the light of the fact that:  the Web size is increasing and so the number of answers  the Web coverage is a problem for a single search engine  Web pages are heterogeneous  User needs are subjective and time-varying  “list of keywords” paradigm for a user query may be ambiguous SnakeT: clusters the web-snippets returned by many search engine(s) into hierarchically labeled folders which are created on-the-fly to catch the various meaning of the answers returned for a user query

44 ECD - Industrial Day, Roma 10 Giugno 2004 A commercial example: Vivisimo Mainly a black-art: IBM India [WWW 04] and Microsoft Cina [SIGIR 04] made their software not publicy available

45 ECD - Industrial Day, Roma 10 Giugno 2004 SnakeT  It offers various interesting features:  Labels are non-contiguous sentences of variable length selected on the basis of two knowledge bases  13 search engines are queried on-the-fly  Hierarchy is built via a greedy strategy which aims for:  Good coverage of the web-snippets,  Effective readability of the labels  Parent labels are NOT substring of descending labels  Open-source architecture written in C and Perl

46 ECD - Industrial Day, Roma 10 Giugno 2004 SnakeT : An example fo use

47 ECD - Industrial Day, Roma 10 Giugno 2004 SnakeT : An example fo use Look at the DEMO

48 ECD - Industrial Day, Roma 10 Giugno 2004 Content delivery (WP 2.1)  Clustering of  s (manco)  XML documents (chiara)  ??

49 ECD - Industrial Day, Roma 10 Giugno 2004 On going and future activities  Work in progress  Pursuing our goal of exploiting USAGE, STRUCTURE, CONTENT Web data to improve efficacy and efficiency in the interaction of the user with the Web  Implementation of additional WOS layers  Compression booster, XML clustering  Future work (medium-long term)  WOS, final version  Community-oriented ranking  Content (news, xml,..) clustering  Cooperation with Nutch.org (Doug Cutting in Pisa next October)  etc

50 ECD - Industrial Day, Roma 10 Giugno 2004 Deployment scenarios  Concerning the role of the WOS and of the ECD applications three (non-exclusive) possible deployment scenarios could be devised  The WOS is a research infrastructure, in the spirit of the WebBase project at Stanford University  The WOS is an infrastructure for web analytics services to be offered to third parties, in a spirit close to the WebFountain IBM project  The WOS can become a product for Web Data Management Systems aimed at developing and engineering web mining ECD applications, again in a spirit close to WebBase

51 ECD - Industrial Day, Roma 10 Giugno 2004 Demo Session  Three demos here  WOS: browsing usage data (Mirko Nanni, Vincenzo Bacarella)  SnakeT: Web snippets clustering (Paolo Ferragina, Antonio Gullì)  ANTIX: Participatory Search System (Andrea Esuli)  Some other activities described in the Posters

52 ECD - Industrial Day, Roma 10 Giugno 2004 More information  Interested people can find these slides, more information, documents and the full list of publications at the address: 


Download ppt "Enhanced Content Delivery Action 2: Mine the Web Industrial Day Roma, 10 Giugno 2004."

Similar presentations


Ads by Google