Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme December 2002.

Similar presentations

Presentation on theme: "1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme December 2002."— Presentation transcript:

1 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme December 2002

2 2 XML warehouse – 2002 2 Organization The context and motivations XML warehouse Xyleme: An XML warehouse Zooms on some aspects of the technology – Scaling – Mass storage of XML – XML query processing – Semantic integration – Web page ranking – Query subscription Xyleme : the company, in very brief

3 3 XML warehouse – 2002 3 The context The Web and XML are changing dramatically the world of distributed information

4 4 XML warehouse – 2002 4 The Web of yesterday Protocol: HTTP Documents: HTML Millions of independent web sites and billions of documents Browsing and keyword search (full-text indexing) Publication of databases using forms Data management with the Web –HTML is primarily for humans –Data management applications on the Web Based on hand-made wrappers Expensive, incomplete, short-lived, not adapted to the Web constant change No real support for distributed data management!

5 5 XML warehouse – 2002 5 What is changing Information used to live in islands and a lot of its value was wasted 1.Different formats: relational, meta data, documents and text, data exchange formats… –A Web standard for data exchange, XML, is fixing it –XML can capture all kinds of information over a wide spectrum of information –XML comes with a family of emerging standards: XML schema, XSL/T, Xquery, domain specific schemas… 2.Different computers, platforms, languages, applications –Web services, e.g., SOAP, are fixing it –SOAP allows ubiquitous computing on the Internet –SOAP comes with a family of emerging standards: WSDL, UDDI

6 6 XML warehouse – 2002 6 What is changing XML and Web services provide a uniform access to information, independent of platform, system, language, communication protocol and data format… The dream for distributed data management The gathering, integration, consolidation, analysis of distributed information become feasible at a much lower cost

7 7 XML warehouse – 2002 7 (1) XML covers the information spectrum Structured Data Minimal structure Meta dataHierarchy + BooksContractsCatalogs Bank accounts Emails Financial Reports Insurance Policies Economical Analysis Derivatives Inventory Political analysis Insurance Claims Financial NewsSports News Resumes

8 8 XML warehouse – 2002 8 Very structured information such as databases –Most DBMS now export in XML Semi-structured data such as data exchange formats (ASN.1, SGML), e.g., technical documentation Documents –Meta-data: Author, date, status –Existing structure in them: chapter, section, table of content and index –Possibly tagging of elements in it (citation, lists) –Links to other documents Meta data for unstructured data such as images and sound Plain text XML covers the information spectrum XML

9 9 XML warehouse – 2002 9 XMLs asset: the marriage of text and structure labeled ordered trees where leaves are text Marriage of document and database worlds Marriage of full text indexing (keyword search) and structure indexing (SQL-style query) Is it the ultimate data model? No Purely syntax – more semantics needed Is it OK for now? Definitely yes (because it is a standard)

10 10 XML warehouse – 2002 10 XMLs asset: typing Applications need typing and XML data can be typed if needed (DTD and XML schema) Trees Logical Granularity – neither page or document level – but the piece of information that is needed Semantics and structure are in tags and paths –product-table/product/reference –product-table/product/price product designation description price reference product-table

11 11 XML warehouse – 2002 11 HTML Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 Information System HTML The X23 new camera replaces the X22. It comes equipped with a flash (worth by itself 53.99 $ ) and provides great quality for only 359.99 $. The new robot R2D2 … Text + presentation - Where is the data ? hard

12 12 XML warehouse – 2002 12 XML Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99... Information System XML camera 359.99 … Robot 19350 …... Data + Structure = Semistructured (presentation elsewhere) easy

13 13 XML warehouse – 2002 13 (2) Web services and ubiquitous distributed computing Possibility to activate a method on some remote web server Exchange information in XML: input and result are in XML Ubiquitous XML distributed computing infrastructure 2 main applications –E-commerce –Access to remote data With XML and Web services, it is possible –To get information from virtually anywhere –To provide information to virtually anywhere

14 14 XML warehouse – 2002 14 Accessing remote information Application using gene banks Query some data services that provide candidate genes Gene banks processing Use some processing services Heterogeneous formats, protocols, etc.

15 15 XML warehouse – 2002 15 Same with web services Query some data services that provide candidate genes Gene banks processing Use some processing services Web Application using gene banks Uniform access to information

16 16 XML warehouse – 2002 16 XML and Web services Exchange of information –E-commerce, B2B, G2C –Cooperative work Information brokers –Web sites, portals –Content publication in general Mediation mode: get the XML pages when needed Warehouse mode: load them in advance

17 17 XML warehouse – 2002 17 Advantages of a warehouse approach Allows for support of complex query processing with high performance Allows for complex analysis of the data Allows for enriching the information Allows for better monitoring of information Allows for versioning, archiving, temporal queries if needed Mediator approach is preferable or compulsory in some applications –Supply chain –Comparative shopping –Typically for volatile information such as plane ticket price

18 18 XML warehouse – 2002 18 XML warehouse

19 19 XML warehouse – 2002 19 Main functionalities Feeding Enrichment ExploitationRepository Admin GUI User GUI Editing & Pub Access Reporting Sub View & Integration User GUI API Warehousing Analysis (data warehouse) (OLAP)

20 20 XML warehouse – 2002 20 Main functionalities (1) Feeding Loading from the Web (Internet and Intranet) –Web search –Web crawl –Access Web data via forms or Web services Plug-ins to load from –File systems, document management systems –Data bases, LDAP –Newsgroup, emails –Other applications Extraction and transformation –XSL-T or Xquery mappings for XML sources –XML-izers to load data from other formats Monitoring of the feeding

21 21 XML warehouse – 2002 21 Main functionalities (1) Feeding – continued User feeding –Document editing –Meta data editing –Using WebDAV protocol Publication By GUI or from programs (SOAP-based API)

22 22 XML warehouse – 2002 22 Main functionalities (2) Repository Storage of massive volume of XML (terabytes) Indexing of massive volume of XML –By structure –By full-text –Linguistic support: stemming, synonyms, etc. Very efficient XML query processing Importance ranking Monitoring of the warehouse (support for subscriptions) Access control and security Versioning, archiving Recovery No full transaction mechanism

23 23 XML warehouse – 2002 23 Main functionalities (3) Enrichment Global organization –Global schema management Management of collections –Incorporate domain ontologies and thesauri –Document classification –Cleaning by filtering out documents from collections, etc. Document enrichment –Concept extraction and tagging –Cleaning inside de document –Summarization, etc. Relationships between documents –Tables of contents –Tables of index –Cross referencing, etc.

24 24 XML warehouse – 2002 24 Main functionalities (4) View and integration View management –Document restructuring/mapping –Schema to schema mapping Semantic integration –Manual for complex ones and (semi-) automatic for simple ones –Tools to analyze a set of schemas –Tools to integrate them –Processing for queries on integration view Management of virtual data in a mediator style

25 25 XML warehouse – 2002 25 Functionalities (5) Exploitation Access to the warehouse –Browsing –Querying by keywords, XPaths or Xquery –Temporal queries Query subscription Reporting –Generation of complex reports with pointers to documents, counts, abstracts… –Organized by collections, content, domains… By GUI or from programs (Web service-based API)

26 26 XML warehouse – 2002 26 Admin: Specify the lifecycle of information in the warehouse starting from its acquisition Specify with parameters (in red): documents to process Add from a toolbox, some processing to apply (in pink) Specify when processing should be applied (in green) Loading from /u/news/* start now Transformation By some XML-izer X flow Storage in Collection Z flow Classification off flow Concept Tagging off flow Indexing flow Monitoring of Y flow

27 27 XML warehouse – 2002 27 Specifying the enrichment What processing should be performed –Applications that come with the system –Arbitrary processing provided as Web services Interface of services –XML input: the documents or collection of documents in the warehouse to be processed –XML output: the result Where to plug the result –Where to store the new documents (collections, names) –Where to put enrichments in existing documents When to start the processing –At the time the document is loaded –At some later time, assuming some information has already been gathered (dependencies)

28 28 XML warehouse – 2002 28 Choose presentation style User: queries and reporting Choose the collections of interest Classify/group results for presentation and drilling Quantity of results Preference ranking and possible relaxation Choose the criteria of selection Choose what to extract as a result WHERE CLAUSESELECT CLAUSEFROM CLAUSE PREFER CLAUSE ORGANIZE CLAUSE STYLE CLAUSE

29 29 XML warehouse – 2002 29 Example From collections MuséeRodin, WebMuseum, LACMA Where Art_Item/ artist [Name=Rodin] Select Name, Owner, Annotations Prefer 1.Rodin in title page 2.Owner is public or owner is in France –Get first 20 Organize as 1.Art_Item/material sculpture, painting, others 2.Owner Present as …

30 30 XML warehouse – 2002 30 Xyleme An XML warehouse Zooms on some aspects of the technology

31 31 XML warehouse – 2002 31 Xyleme: a dynamic XML warehouse Scaling Feeder –E.g., loading with a single PC millions of Web documents per day – and scale up with more machines Repository –E.g., storing and indexing of tera Bytes of XML (other formats, e.g., pdf) Enrichment –E.g., tools (together with partner) for classification and concept extraction View and semantic integration –E.g., a suite of tools of XML integration Exploitation –E.g., access via SOAP and graphic interfaces

32 32 XML warehouse – 2002 32 1. An architecture to scale

33 33 XML warehouse – 2002 33 The scaling Size of data: billions of XML documents Size of data and index: terabytes Number of customers –thousands of simultaneous queries –millions of subscriptions An architecture based on distribution

34 34 XML warehouse – 2002 34 Architecture Cluster of PCs Runs on Linux and C++ (also Solaris) Communications –local: Corba (Orbacus) –external: HTTP, SOAP Distribution between autonomous machines

35 35 XML warehouse – 2002 35 Functional architecture Repository and Index Manager Change Control Query Processor Semantic Module User Interface Xyleme Interface -------------------- I N T E R N E T ----------------------- Web Interface Acquisition & Crawler Loader

36 36 XML warehouse – 2002 36 Architecture and scaling Index -------------------- I N T E R N E T ----------------------- Change Control and Semantic Integration Change Control and Semantic Integration ETHERNETETHERNET Repository RepositorryRepository Loader |Query Acquisition and Maintenance Acquisition and Maintenance

37 37 XML warehouse – 2002 37 2. Data Acquisition and Maintenance of Web pages (internet or intranet)

38 38 XML warehouse – 2002 38 Discover HTML/XML pages on the web (intranet or internet) Parse/load pages and follow links Manage metadata for the known pages Do this under bounded resources –Network bandwidth –Memory and disk resources Tested on the Internet in October 2001 –Millions of pages crawled per day on each crawler –Up to 10 crawlers and close to 1 billion HTML/XML pages discovered in a couple of months Crawl le Web

39 39 XML warehouse – 2002 39 Optimization problem –Decide which page to crawl or refresh next to optimize the quality of the warehouse Criteria: –Read more often important pages Based on customers preferences Page importance can also be used to order query results –Dont read a page that is probably up-to-date Uses an estimate of the change frequency for each page Advantages –Have a fresh view of useful portions of information Page Scheduling Optimization

40 40 XML warehouse – 2002 40 Determine which page to read next –minimize a particular cost function under some constraint (bandwidth of crawlers) The penalty for a page takes into account: –importance of the page (to be defined next) –customer needs (obtained via pub/sub) –staleness of the data penalty for being out of date penalty for aging The page scheduler fully controls the crawling –vs. random crawling in classic search engines Page scheduling

41 41 XML warehouse – 2002 41 Based on customers criteria and on the link structure of the web Intuition: a page is important if many important pages reference it Fixpoint definition: importance vector Imp –Proposed by IBM; used by search engines such as Google –Link matrix: M(i,j) if page i refers to page j –Outdegree of page i: out(i) –Imp 0 (k) = 1/N (initialization) –Imp m (k) = i [M(i,k) * Imp m-1 (i)/out(i) ] (iteration) –Imp is the limit Page Importance

42 42 XML warehouse – 2002 42 Novel technology developed by Xyleme Patent pending On-line evaluation of page importance Use much less resources Faster reaction to changes on the web Page Importance

43 43 XML warehouse – 2002 43 2. XML Repository

44 44 XML warehouse – 2002 44 Document systems –Good for keyword search –No or inefficient support for structure search Relational store (e.g., Oracle 8i) –Well adapted for some applications –Very typed data and Tables: efficient –Otherwise: too many joins and inefficient Object database store (e.g., Excellon) and Native XML databases (e.g., Tamino) –Same issues Xyleme XML Native storage Storing XML

45 45 XML warehouse – 2002 45 Goal –minimize I/O for direct access and scanning –efficient direct accesses both with fulltext indexing and structure indexing –good compaction but not at the cost of access Efficient storage of trees –use fixed length storage pages –variable length records inside a page Main issue: tree balancing Repository

46 46 XML warehouse – 2002 46 Record 1 Record 3Record 2 Tree Balancing

47 47 XML warehouse – 2002 47 Large collections may use several records Tree Balancing

48 48 XML warehouse – 2002 48 3. Semantic Data Integration

49 49 XML warehouse – 2002 49 Based on word occurrences in document and statistical resources –Classification by semantic domain –Classification by language Use the XXX classifier Classification

50 50 XML warehouse – 2002 50 Semantic Integration Web Heterogeneity Many possible types for data in a particular domain, many DTDs Semantic Integration –one abstract DTD for the domain –gives the illusion that the system maintains an homogeneous database for this domain 1 domain = 1 abstract DTD

51 51 XML warehouse – 2002 51 Choose an abstract DTD for each domain –For each concrete DTD in a domain, find how it relates to the abstract DTD using linguistic tools such as WordNet –Provide relationships between paths in the concrete and abstract DTD –Possibly automatic, manual or hybrid With manual mapping, a domain expert may specify much more complex views Query processing: process queries on the Abstract DTD Views

52 52 XML warehouse – 2002 52 4. Query Processing

53 53 XML warehouse – 2002 53 Today: A mix of OQL and XQL Tomorrow: the future W3C standard Example select product/name, product/price from doc in catalogue, product in doc/product where product//components contains flash and product/description contains camera Query Language

54 54 XML warehouse – 2002 54 Cluster of documents = physical collection of documents ( semantic domain) Distribution Storage machine –in charge of a cluster of documents Index machine –index for a cluster Data Distribution

55 55 XML warehouse – 2002 55 Standard inverted index –word documents that contain this word Xyleme index –word elements that contain this word document + element identifier Goal: more work can be performed without accessing data Step0: Indexing

56 56 XML warehouse – 2002 56 Query on an abstract dtd Localization of machines that host concrete DTDs that will participate in the query global query on abstract dtd union of queries on local machines local queries catalogue/product/price relevant for machine 56 machine 45 Step1: Localization

57 57 XML warehouse – 2002 57 Algebraic rewriting Linear search strategy based on simple heuristics –use in memory indexes –minimize communication Optimization of the global plan Optimization of the local plans Step2: Optimization

58 58 XML warehouse – 2002 58 A plan usually consists of: 1. parallel translation from abstract queries to concrete patterns on the relevant index machines 2. parallel index scans to identify the relevant elements for a concrete pattern 3. parallel construction of resulting elements 4. pipeline evaluation (i.e., no intermediate data structure) Note: 2. Requires smart indexes Step3: Execution

59 59 XML warehouse – 2002 59 For each concrete pattern, the local plan is optimized dynamically for each concrete pattern scan the element ids &234 &177 for catalogue/product/price scan relevant concrete pattern d1//camera/price d2/product/cost d3/piano/price... Abstract2Concrete

60 60 XML warehouse – 2002 60 Essential for query processing Identifier = (preorder rank/postorder rank) –X ancestor of Y pre(X) post(Y) –E.g., 2 2 => (2,4) ancestor (5,2) A B C D E F G 1 2 34 5 6 7 1 2 3 4 5 6 7 Text Identifiers

61 61 XML warehouse – 2002 61 5. Change Control

62 62 XML warehouse – 2002 62 Users are often interested in changes to the web Change monitoring –query subscription Soon to come: Version management –representation and storage of changes Change management

63 63 XML warehouse – 2002 63 Users subscribe to certain events such as Update of a particular page, a page in a given site Discovery of a new page containing some specific words Insertion of a particular element in some pages (new products in a catalog) Detection of illegal copies of selected documents Users may request to be notified Immediately at the time the event is detected Regularly, e.g., weekly After a certain number of event detections Query Subscription

64 64 XML warehouse – 2002 64 subscription myPariscope % what are the new movie entries in Pariscope site monitoring newMovies select URL where URL extends* and new(self) % manage the changes in the movies showing in Paris continuous delta Showing select... from... where when daily notify daily% send me a daily report Examples

65 65 XML warehouse – 2002 65 HTML parser XML loader metadata manager d/46 complex event detection atomic event 46: URL matches pattern* atomic event 67: XML document contains the tag soccer d/46,67 Loading of millions of pages/day d loading document Atomic Events

66 66 XML warehouse – 2002 66 HTML parser XML loader complex event detection complex event 12: 67 & 46 (XML document contains the tag soccer and URL matches pattern*) Several millions of pages crawled per day Hundreds of millions of alerts raised Millions of subscriptions Complex Events

67 67 XML warehouse – 2002 67 Very efficient/scalable algorithm for complex event detection Notifications by –Email –Web posting –Web services in SOAP notification processor Millions of notifications/day complex event detection alerts notifications Notification Processing

68 68 XML warehouse – 2002 68 Xyleme in short Spin-off of lINRIA (National Research Institute) –Technology developed in research project of 60 man/years Creation of Xyleme SA in September 2000 Now about 25 persons : 13 R&D, 4 Services, 10 marketing, sales & admin. Customers include: Press agency (AFP), Newspaper groups (Moniteur, Le Monde), National library (BNF) First round of capital in 2000 (SGAM & Viventures). Second round in 2002 (Deutsche Bank)

69 69 XML warehouse – 2002 69 Thank you (*) If you want to know more about Xyleme

Download ppt "1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme December 2002."

Similar presentations

Ads by Google