Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme

Similar presentations

Presentation on theme: "1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme"— Presentation transcript:

1 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme

2 2 Web Monitoring 20022 Organization 1.Introduction –What is there to monitor? –Why monitor? 2.Some applications of web monitoring 3.Web archiving –An experience: the archiving of the French web –Page importance and change frequency 4.Creation of a warehouse using web resources –An experience: the Xyleme Project –Monitoring in Xyleme 5.Queries and monitoring 6.Conclusion

3 3 Web Monitoring 20023 1. Introduction

4 4 Web Monitoring 20024 Billions of pages + millions of servers Query = keywords to retrieve URLs –Imprecise; query results are useless for further processing Applications: based on ad-hoc wrapping –Expensive; incomplete; short-lived, not adapted to the Web constant changes Poor quality –Cannot be trusted: spamming, rumors… –Often stale –Our vision of it often out-of-date Importance of monitoring The Web Today

5 5 Web Monitoring 20025 The HTML Web Structure Source : IBM, AltaVista, Compaq

6 6 Web Monitoring 20026 Source: HTML: Percentage covered by Crawlers

7 7 Web Monitoring 20027 So much for the world knowledge… Most of the web is not reached by crawlers (hidden web) Some of the public HTML pages are never read Most of what is on the web is junk anyway Our knowledge of it may be stale Do not junk the techno – improve it!

8 8 Web Monitoring 20028 What is there to monitor? Documents: HTML but also doc, pdf, ps… Many data exchange formats such as asn1, bibtex… New official data exchange format: XML Hidden web: database queries behind forms or scripts Multimedia data: ignored here Public vs. private (Intranet or Internet+passwd) Static vs. dynamic

9 9 Web Monitoring 20029 What is changing? XML is coming –Universal data exchange format –Marriage of document and database worlds –Standard query language: XQuery –Quickly growing on Intranet and very slowly on public web (less than 1%) Web services are coming –Format for exporting services –Format for encapsulating queries More semantics to be expected –RDF for data –WSDL+UDDI for services

10 10 Web Monitoring 200210 What is not changing fast or even getting worse Massive quantity of data – most of it junk Lots of stale data Very primitive HTML query mechanisms (keywords) No real change control mechanism soon –Compare database queries (fresh data) with web search engines (possibly stale) –Compare: database triggers (based on push) to web notification services (most of the times based on pull/refresh)

11 11 Web Monitoring 200211 The need to monitor the web The web changes all the time Users are often as interested in changes as by data – new products, new press articles, new price… Discover new resources Keep our vision of the web up-to-date Be aware of changes that may be of interest, have impact on our business

12 12 Web Monitoring 200212 Analogy: databases Databases –Query: instantaneous vision of data –Trigger: alert/notification of some changes of interest Web –Query: need monitoring to give correct answer –Monitoring: to support alert/notifications of changes of interest

13 13 Web Monitoring 200213 Web vs. database monitoring Quantity of data: larger on the web Knowledge of data –structure and semantics known in databases Reliability and availability –High in databases; null on the web Data granularity –Tuple vs. page in HTML or element in XML Change control –Databases: support from data sources/triggers –Web: no support; pull only in general

14 14 Web Monitoring 200214 2. Some applications of web monitoring

15 15 Web Monitoring 200215 Comparative shopping Unique entry point to many catalogs Data integration problem Main issue: wrapping of web catalogs –Semi-automatic so limited to a few sites –Simpler and towards automatic with XML Alternatives –Mediation when data change very fast prices and availability of plane tickets –Warehousing otherwise need to monitor changes

16 16 Web Monitoring 200216 Web surveillance Applications –Anti-criminal and anti-terrorist intelligence, e.g., detecting suspicious acquisition of chemical products –Business intelligence, e.g., discovering potential customers, partners, competitors Find the data (crawl the web) Monitor the changes –new pages, deleted pages, changes in a page Classify information and extract data of interest –Data mining, text understanding, knowledge representation and extraction, linguistic… Very AI

17 17 Web Monitoring 200217 Copy tracking Example: a press agency wants to check that people are not publishing copies of their wires without paying Flow of candidate documents Slice the document Query to search engine Or specific crawl + pre-filter Filter detection 1 23

18 18 Web Monitoring 200218 Web archiving We will discuss an experience in archiving the French web

19 19 Web Monitoring 200219 Creation of a data warehouse with resources found of the web We will discuss some work in the Xyleme project on the construction of XML warehouses

20 20 Web Monitoring 200220 3. Web archiving An experience towards the archiving of the French web with Bibliothèque Nationale de France

21 21 Web Monitoring 200221 Dépôt légal (legal deposit) Books are archived since 1537, a decision by King Francois the 1st The Web is an important and valuable source of information that should also be archived What is different? –Number of content providers: 148000 sites vs. 5000 editors –Quantity of information: millions of pages + video/audio –Quality of information: lots of junk –Relationship with editors: freedom of publication vs. traditional push model –Updates and changes occur continuously –The perimeter is unclear: what is the French web?

22 22 Web Monitoring 200222 Goal and Scope Provide future generations with a representative archive of the cultural production Provide material for cultural, political, sociological studies The mission is to archive a wide range of material because nobody knows what will be of interest for future research In traditional publication, publishers are filtering contents. No filter on the web

23 23 Web Monitoring 200223 Similar Projects The Internet Archive –The Wayback machine –Largest collection of versions of web pages Human selection based approach –select a few hundred sites and choose a periodicity of archiving –Australia and Canada The Nordic experience –Use robot crawler to archive a significant part of the surface web –Sweden, Finland, Norway Problems encountered: Lack of updates of archived pages between two snapshots The hidden Web

24 24 Web Monitoring 200224 Orientation of our experiment Goals: –Cover a large portion of the French web Automatic content gathering is necessary –Adapt robots to provide a continuous archiving facility Have frequent versions of the sites, at least for the most important ones Issues: –The notion of important sites –Building a coherent Web archive –Discover and manage important sources of deep Web

25 25 Web Monitoring 200225 First issue: the perimeter The perimeter of the French Web: contents edited in France Many criteria may be used: –The French language but many French sites use English (e.g. INRIA) + many French-speaking sites are from other French speaking countries or regions (e.g. Quebec) –Domain Name or resource locators;.fr sites, but many are also –Address of the site: physical location of the web servers or address of the owner Other criteria than the perimeter –Little interest in commercial sites –Possibly interest in foreign sites that discuss French issues Pure automatic does not work involve librarians

26 26 Web Monitoring 200226 Second issue: Site vs. Page archiving The Web: –Physical granularity = HTML pages –The problem is inconsistent data and links Read page P; one week later read pages pointed by P – may not exist anymore –Logical granularity? Snapshot view of a web site –What is a site? INRIA is +… is the provider of many –There are technical issues (rapid firing, …)

27 27 Web Monitoring 200227 Importance of data

28 28 Web Monitoring 200228 What is page importance? Le Louvre homepage is more important than an unknown persons homepage Important pages are pointed by: –Other important pages –Many unimportant pages This leads to Google definition of PageRank –Based on the link structure of the web –used with remarkable success by Google for ranking results Useful but not sufficient for web archiving

29 29 Web Monitoring 200229 Page Importance Importance –Link matrix L –In short, page importance is the fixpoint X of the equation L*X = X –Storing the Link matrix and computing page importance uses lots of resources We developed a new efficient technique to compute the fixpoint –Without having to store the Link matrix –Technique adapts to automatically to changes

30 30 Web Monitoring 200230 Site vs. pages Limitation of page importance –Google page importance works well when links have a strong semantic –More and more web pages are automatically generated and most links have little semantics More limitation –Refresh at the page level presents drawbacks So we also use link topology between sites and not only between pages

31 31 Web Monitoring 200231 Experiments Crawl –We used between 2 to 8 PCs for Xyleme crawlers for 2 months –Discovery and refresh based on page importance Discovery –We looked at more than 1.5 billion (most interesting) web pages –We discovered more than 15 million *.fr pages – about 1.5% of the web –We discovered 150 000 *.fr sites Refresh –Important pages were refreshed more often –Takes into account also the change rate of pages Analysis of the relevance of site importance for librarians –Comparison with ranking by librarians –Strong correlation with their rankings

32 32 Web Monitoring 200232 Issues and on going work: Other criteria for importance Take into account indications by archivists –They know best -- man-machine-interface issue Use classification and clustering techniques to refine the notion of site Frequent use of infrequent words –Find pages dedicated to specific topics Text Weight –Find pages with text content vs. raw data pages) Others

33 33 Web Monitoring 200233 4. Creation of a Warehouse from Web data The Xyleme Project

34 34 Web Monitoring 200234 Xyleme in short The Xyleme project –Initiated at INRIA –Joint work with researchers from Orsay, Mannheim and CNAM-Paris universities The Xyleme company: –Started in 2000 –About 30 people –Mission: Deliver a new generation of content technologies to unlock the potential of XML Here: focus on the Xyleme project

35 35 Web Monitoring 200235 Goal of the Xyleme project Focus is on XML data (but also handle HTML) Semantic –Understand tags, partition the Web into semantic domains, provide a simple view of each domain Dynamicity –Find and monitor relevant data on the web –Control relevant changes in Web data XML storage, index and queries –Manage efficiently millions of XML documents and process millions of simultaneous queries

36 36 Web Monitoring 200236 Corporate information environment with Xyleme Web Information System Repository Query Engine Xyleme Server Crawling & interpreting data publishing Systematic updating queries searches XML Repository

37 37 Web Monitoring 200237 XML in short Data exchange format eXtensible Mark-up Language (child of SGML) Promoted by W3C and major industry players XML document: ordered labeled tree Other essential gadgets: unicode, namespaces, attributes, pointers, typing (XML schema)…

38 38 Web Monitoring 200238 XML magic in short Presentation is given elsewhere (style-sheet) Semantic and structure are provided by labels So it is easy to extract information Universal format understood by more and more softwares (e.g., exported by most databases + read by more and more editors) More and more tools available

39 39 Web Monitoring 200239 It is easy to extract information Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 Information System camera 359.99 … XML Ref product/reference Name product/designation Price product/price

40 40 Web Monitoring 200240 4.1 Xyleme: Functionality and architechture

41 41 Web Monitoring 200241 The goal of Xyleme project: XML Dynamic Datawarehouse Many research issues –Query Processor –Semantic Classification –Data Monitoring –Native Storage –XML document Versionning –XML automatic or user driven acquisition –Graphical User Interface through the Web

42 42 Web Monitoring 200242 Repository and Index Manager Change Control Query Processor Semantic Module User Interface Xyleme Interface Functional Architecture Acquisition & Crawler -------------------- I N T E R N E T ----------------------- Web Interface Loader

43 43 Web Monitoring 200243 Index Interface Change | Semantic Global Query Interface Change | Semantic Global Query -------------------- I N T E R N E T ----------------------- ETHERNETETHERNET Web Interface Crawler Global Loader DTDi,DTDj XML DOC extent DTDk,DTDl XML DOC extent DTDm,.. XML DOC extent DTDp... XML DOC extent Loader |Query|Version Repository Loader |Query|Version Repository Architecture

44 44 Web Monitoring 200244 Prototype main choices Network of Linux PCs C++ on the server side Corba for communications between PCs HTTP + SOAP for communications for external communications –Exception for query processing

45 45 Web Monitoring 200245 Scaling Parallelism based on Partitioning –XML documents –URL table –Indexes (semantic partitioning) Memory replication Autonomous machines (PCs) –Caches are used for data flow

46 46 Web Monitoring 200246 4.2 Xyleme: Data Acquisition

47 47 Web Monitoring 200247 Data Acquisition Xyleme crawler visits the HTML/XML web Management of metadata on pages Sophisticate strategy to optimize network bandwidth –importance ranking of pages –change frequency and age of pages –publications (owners) & subscriptions (users) Each crawler visits about 4 million pages per day Each index may create index for 1 million pages per day

48 48 Web Monitoring 200248 4.3 Xyleme: Change Control

49 49 Web Monitoring 200249 Change Management Monitoring –subscriptions –continuous queries –versions The Web changes all the time Data acquisition –automatic and via publication

50 50 Web Monitoring 200250 Subscription They may request to be notified at the time the event is detected by Xyleme regularly, e.g., once a week Users can subscribe to certain events, e.g., changes in all pages of a certain DTD or of a certain semantic domain insertion of a new product in a particular catalog or in all catalogs with a particular DTD

51 51 Web Monitoring 200251 Continuous Queries Queries asked regularly or when some events are detected –send me each Monday the list of movies in Pariscope –send me each Monday the list of new movies in Pariscope –each time you detect that a new member is added to the Stanford DB-group, send me their lists of publications from their homepages

52 52 Web Monitoring 200252 Versions and Deltas Store snapshots of documents For some documents, store changes (deltas) –storage: last version + sequence of deltas –complete delta: reconstruct old versions –partial delta: allow to send changes to the user and allow refresh –Deltas are XML documents –so changes can be queried as standard data Temporal queries –List of products that were introduced in this catalog since January 1st 2002

53 53 Web Monitoring 200253 The Information Factory loaders subscription processor send notification continuous queries time documents and deltas Repository version queries results changes detection Web

54 54 Web Monitoring 200254 Results Very efficient XML Diff algorithm –compute difference between consecutive versions Representation of deltas based on an original naming scheme for XML elements –one element is assigned a unique identifier for its entire life –compact way of representing these IDs Efficient versioning mechanism

55 55 Web Monitoring 200255 Results Sophisticate monitoring algorithm –Detection of simple patterns (conjunctions) at the document level –Detection of changes between consecutive versions of the same documents Scale to dozens of crawlers loading millions of documents per day for a single monitor

56 56 Web Monitoring 200256 Issues: languages for monitoring In the spirit of temporal languages for relational databases But –Data model is richer (trees vs. tables) –Context is richer: versions, continuous queries, monitoring of data streams…

57 57 Web Monitoring 200257 4.4 Xyleme: Semantic Data Integration

58 58 Web Monitoring 200258 Data Integration One application domain -- Several schemas –heterogeneous vocabulary and structure Xyleme Semantic Integration –è –gives the illusion that the system maintains an homogeneous database for this domain –abstracts a set of DTDs into a hierarchy of pertinent terms for a particular domain (business, culture, tourism, biology, …)

59 59 Web Monitoring 200259 Technology in short Cluster DTDs into application domains For an application domain – semi- automatically –Organize tags into a hierarchy of concepts using thesauri such as Wordnet and other linguistic tool –This provides the abstract DTD for the particular domain –Generate mappings between concrete DTDs and the abstract one

60 60 Web Monitoring 200260 4.5 Xyleme: Query Processing

61 61 Web Monitoring 200261 Xyleme Query Language A mix of OQL and XQL, will use the W3C standard when there will be one. Select product/name, product/price From doc in catalogue, product in doc/product Where product//components contains flash and product/description contains camera

62 62 Web Monitoring 200262 Principle of Querying catalogue/product/price d1//camera/price d2/product/cost catalogue/product/description d1//camera/description d2/product/info, ref d2/description query on abstract dtdUnion of concrete queries (possibly with Joins) MAPPINGS between concrete and abstract DTDs

63 63 Web Monitoring 200263 Query Processing 1. Partial translation, from abstract to concrete, to identify machines with relevant data 2.Algebraic rewriting, linear search strategy based on simple heuristics: in priority, use in memory indexes and minimize communication 3.Decomposition into local physical subplans and installation 4.Execution of plans 5.If needed, Relaxation

64 64 Web Monitoring 200264 Query processing Essential use of a smart index combining full-text and structure

65 65 Web Monitoring 200265 4.6 Xyleme: Repository

66 66 Web Monitoring 200266 Storage System Xyleme store –efficient storage of trees in variable length records within fixed length pages Balancing of tree branches in case of overflow –minimize the number of I/O for direct access and scanning –good compromise : compaction / access time

67 67 Web Monitoring 200267 Record 1 Record 3Record 2 Tree Balancing in Xyleme Store Record 4 More children

68 68 Web Monitoring 200268 5. Conclusion

69 69 Web Monitoring 200269 Web monitoring Very challenging problem –Complexity due to the volume of data and the number of users –Complexity due to heterogeneity –Complexity due to lack of cooperation from data sources Many issues to investigate

70 70 Web Monitoring 200270 New directions Active web sites –Friendly sites willing to cooperate –Web services provide the infrastructure –Support for triggers Mobile data –Web sites on mobile devices –Issues of availability (device unplugged) –Issues in synchronization –Geography dependent queries

Download ppt "1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme"

Similar presentations

Ads by Google