1 EDA06 - Entrepôts de contenu1 Entrepôts de contenu autour de XML et des services Web Serge Abiteboul INRIA-Futurs et LRI-Paris 11.

2 2 Introduction

3 3 EDA06 - Entrepôts de contenu3 Joint works – some participants & projects Xyleme: Sophie Cluet, Guy Ferran & many others Acware within Edot project: Benjamin Nguyen, Gabriela Ruberg, Gregory Cobena Active XML within DbGlobe project: Omar Benjelloun, Ioana Manolescu, Tova Milo & many others KadoP with Edos project: Ioana Manolescu, Nicoleta Preda & many others

4 4 EDA06 - Entrepôts de contenu4 Success stories in the time of the Internet bubble: Information management Google: management of Web pages Mapquest: management of maps Amazone: book catalogue eBay: product catalogue Napster (emule, bearshare, etc.): music database Flickr: picture database Wikipedia: dictionary Even in France: Meetic: dating database Kelkoo: comparative shopping

5 5 EDA06 - Entrepôts de contenu5 The trend is towards peer-to-peer infoware Why? The Web is switching from centralized servers to communities and syndication Buzzwords such as Web 2.0 (?) Infoware: classe de logiciels dont l'objectif est non plus de traiter de l'information, mais de la gérer globalement, tellement les quantités sont de plus en plus importantes Analogy: Software development by very structured and controlled groups of programmers vs. open-source software produced by large communities of autonomous developers

6 6 EDA06 - Entrepôts de contenu6 Outline Introduction Content warehouse Concept XML and Web services Xyleme Peer-to-Peer content warehouse Concept Active XML KadoP Conclusion

7 7 Content warehouse

8 8 EDA06 - Entrepôts de contenu8 Warehouse Goal: integrated access to heterogeneous, autonomous, distributed sources of information Main functionalities: acquire, transform, filter, clean and integrate data, support for queries Warehouse vs. mediation Warehouse: information is acquired in advance Mediation: information acquired when needed Classical tradeoff between updates and queries Typically mix of both

9 9 EDA06 - Entrepôts de contenu9 Content warehouse All kinds of content Mail, reports, news, web pages, contacts, catalogs, annotations, etc Text, multimedia, etc. Little is numerical vs. OLAP –some may me mixed, e.g., financial reports Typically found on the Web and not in relational databases

10 10 EDA06 - Entrepôts de contenu10 Content vs. data warehouse Data warehouseXML warehouse Datarelational data numerical values XML text Enrichmentcleaning cleaning, classification, semantics… Integration and viewrelations cube XML QuerySQLXquery, XSLT ExploitationOLAP; statistical tools report generation browsing report generation

11 11 EDA06 - Entrepôts de contenu11 XML Warehouse XML Warehouse Operational data sources Operational data sources Operational data sources Operational data sources Application FeedExploit Import data from many sources Add value to it without interfering with operational data Export integrated views of it Same as a relational warehouse

12 12 EDA06 - Entrepôts de contenu12 The basis of content management Standard for data exchange XML, XML Schema… Extensible Markup Language Labeled ordered trees Foundations: tree automata Query languages XPATH, XQuery… Foundations: tree automata Not perfect but at least exist Xquery Xpath SOAP WSDL XML

13 13 EDA06 - Entrepôts de contenu13 Functionalities Store & Index Query Processing View and Semantic stemming, integration, classification… Web GUI, Web services, reporting… Feeding Exploiting

14 14 EDA06 - Entrepôts de contenu14 Functionalities: Feeding Loading from the Web (Internet and Intranet) Web search Web crawl Access Web data via forms or Web services Plug-ins to load from File systems, document management systems Data bases, LDAP Newsgroup, s Other applications Extraction and transformation XSL-T or Xquery mappings for XML sources XML-izers to load data from other formats Monitoring of the feeding

15 15 EDA06 - Entrepôts de contenu15 Functionalities: More feeding User feeding Document editing Meta data editing Publication API: SOAP and WebDAV

16 16 EDA06 - Entrepôts de contenu16 Functionalities: Storage Storage of (massive volume of) XML (terabytes) Indexing of (massive volume of) XML By structure By full-text Linguistic support: multi language, stemming, synonyms, etc. Very efficient XML query processing Importance ranking Monitoring of the warehouse (support for subscriptions) Access control and security Versioning, archiving Recovery Possibly transaction mechanism

17 17 EDA06 - Entrepôts de contenu17 Functionalities: Enrichment Global organization Global schema management –Management of collections Incorporate domain ontologies and thesauri Document classification Cleaning by filtering out documents from collections, etc. Document enrichment Concept extraction and tagging Cleaning inside de document Summarization, etc. Relationships between documents Tables of contents Tables of index Cross referencing, etc.

18 18 EDA06 - Entrepôts de contenu18 Functionalities: View & integration View management Document restructuring/mapping Schema to schema mapping Semantic integration Manual for complex ones and (semi-) automatic for simple ones Tools to analyze a set of schemas Tools to integrate them Processing for queries on integration view Management of virtual data in a mediator style

19 19 EDA06 - Entrepôts de contenu19 Functionalities: Exploitation Access to the warehouse Browsing Querying by keywords, XPaths or Xquery Temporal queries Query subscription Reporting Generation of complex reports with pointers to documents, counts, abstracts… Organized by collections, content, domains… By GUI or from programs (Web service-based API)

20 20 Xyleme Content warehouse

21 21 EDA06 - Entrepôts de contenu21 Xyleme – in short 1999: Xyleme research project at INRIA 2000: Creation of a spin-off 2006: About 40 people Technology: a content warehouse built around a very efficient and scalable XML repository Application example: all articles of Le Monde in XML

22 22 EDA06 - Entrepôts de contenu22 Xyleme Functionalities Store & Index Query Processing View and Semantic stemming, integration, classification… Web GUI, Web services, reporting… Feeding Exploiting

23 23 EDA06 - Entrepôts de contenu23 Xyleme Architecture XML store Index Loader| Local | Query Global Query Manager Application Server Tomcat|Soap Corba Name Server User Manager Url Manager Notification Mgr HTTP | Web Service API Applications IE/Java/C++/.Net... Java/C++ API or Or Any Platform XML store Index Loader| Local | Query XML store Index Loader| Local | Query Client side Server side

24 24 EDA06 - Entrepôts de contenu24 Structural identifiers and indexing X ancestor of Y pre(X) < pre(Y) and post(X) > post(Y) X parent of Y X ancestor of Y and level(X) = level(Y) - 1 Structural IDs = Prefix-Postfix-Level A B D E C F John G LAN Put(C;[d,p,6,6,1]) Put(John;[d,p,3,1,2]) hash(C) hash(John)

25 25 EDA06 - Entrepôts de contenu25 Query evaluation based on Holistic twig joins (d1, 201, 400) (d1, 224, 201)(d1, 228, 237) A D C John

26 26 Peer-to-peer content warehouse

27 27 EDA06 - Entrepôts de contenu27 The golden triangle of distributed content management on the Web Standard for data exchange XML, XML Schema… Extensible Markup Language Labeled ordered trees Foundations: tree automata Query languages XPATH, XQuery… Standards for distributed computing: Web services SOAP, WSDL, UDDI… Simple Object Access Protocols Corba but simpler and on the Web Xquery Xpath SOAP WSDL XML

28 28 EDA06 - Entrepôts de contenu28 Peer-to-peer A large and varying number of computers cooperate to solve some particular task without any centralized authority Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network Examples search for extraterrestrial intelligence kazaa: obtain free music/video over the net cabal: decryption of 512 bits RSA code grub: P2P Web search

29 29 EDA06 - Entrepôts de contenu29 An XML warehouse in P2P Warehouse: a very centralized system P2P: an ultra distributed system (no authority) P2P warehouse: an oxymoron? No! A warehouse: from a logical viewpoint P2P system: from a physical viewpoint

30 30 EDA06 - Entrepôts de contenu30 data sources mediator data sources warehouse (logical & physical) data sources P2P warehouse (logical) P2P warehouse (physical) data sources P2P mediator Centralized mediation P2P mediation Centralized warehouse P2P warehouse

31 31 EDA06 - Entrepôts de contenu31 P2P XML Warehouse Data sources and peers are distributed, transient and autonomous Information is distributed and replicated Nothing is centralized Not the control, storage, indexing… The machines are cooperating with some level of trust to provide the functionalities of an XML warehouse

32 32 EDA06 - Entrepôts de contenu32 Advantages Disadvantages Performance Optimization of parallelism Avoid bottleneck Replication Availability Replication Cost Avoid the cost of server Share operational cost Dynamicity add/remove new data sources Better scaling Performance Cost for complex queries Communication cost Availability Peers can leave Consistency maintenance Difficult to support transaction Quality Difficult to guarantee quality

33 33 EDA06 - Entrepôts de contenu33 Relational DBMSP2P warehouse Relations Active XML Schema and constraintsOntologies (including Xschema) B-tree, hashing, fulltextAXML indexes & global indexes Disk pagesAXML persistent store SQL (query & update)Query&Update (Xquery-Webdav) ACLNetwork Access Control Historical DBProvenance and history TriggersMonitoring (from distributed DBMS)Replication and partitioning Approximation, incompleteness Idem but even more important Discovery of data/services Multicasting Centralized vs. distributed data management

34 34 EDA06 - Entrepôts de contenu34 Two classes of P2P networks Unstructured P2P networks Local exchange: mappings relate content on different peers Queries are propagated (flooding) SomeWhere,... Structured P2P networks Content is indexed globally and located via the index Local content, global access KadoP,...

35 35 ActiveXML: A framework for distributed data management

36 36 EDA06 - Entrepôts de contenu36 Active XML The standards of distributed data management Active XML = XML documents with embedded Web service calls where service calls are typically in Xquery Intensional & Dynamic This is not a new idea Procedural attributes in relational systems Basis of Object Databases Suns JSP, PHP+MySQL, Apache Jelly… Xquery Xpath SOAP WSDL XML

37 37 EDA06 - Entrepôts de contenu37 Active XML = XML + embedded service calls (omitting syntactic details) Aspen …. ) … May contain calls to any SOAP web service to any AXML web services - to be defined 1

38 38 EDA06 - Entrepôts de contenu38 Not a new idea in databases Not a new idea on the Web Mixing calls to data is an old idea Procedural attributes in relational systems Basis of Object Databases In HTML world Suns JSP, PHP+MySQL Call to Web services inside documents Macromedia MX, Apache Jelly

39 39 EDA06 - Entrepôts de contenu39 What exactly to exchange A parameter of a call contains some service calls The result of a call contains some service calls Do we evaluate these calls before transmitting the data or not Hi John, what is the phone number of the CEO of INRIA? (33 1) Look in INRIA directory at Michel Cosnard Find his name at then look on the directory

40 40 EDA06 - Entrepôts de contenu40 When to activate the call Explicit pull mode Frequency: Daily, weekly, etc. After some event: e.g., when another service call completed This aspect of the problem is related to active databases Implicit pull mode : Lazy When the data is requested Difficulty : detect that the result of a particular request may be affected by a particular call This is related to deductive databases Push mode E.g., based on a query subscription; the web server pushes information to the client E.g., synchronization with an external source This is related to stream and subscription queries

41 41 EDA06 - Entrepôts de contenu41 Active XML peer Peer-to-peer architecture Each Active XML peer Repository: manages Active XML data with embedded web service calls Web client: uses Web services Web server: provides (parameterized) queries/updates over the repository as web services Open source system SUNs Java SDK 1.4 XML parser XPath processor, XSLT engine Apache Tomcat 4.0 servlet engine Apache Axis SOAP toolkit 1.0 X-OQL query processor persistent DOM repository JSP-based user interface JSTL 1.0 standard tag library see AXML peer soap

42 42 KadoP: a P2P system for sharing content

43 43 EDA06 - Entrepôts de contenu43 KadoP model Data: XML Document; views; Active XML; Web services Simple semantics: Concepts, n amespaces, DTDs, iSa, partOf, relatedTo, context documents (for services) Queries: tree pattern query with join KadoP XML data distributed in the P2P network Index is distributed via a DHT Goal: Efficient processing of terabytes of XML with no centralized authority

44 44 EDA06 - Entrepôts de contenu44 Distributed hash tables Typically on a WAN Peers come and go Small number of messages to locate the peer in charge of key k – log n Standard interface: put, get We tried Pastry, Chord and JXTA We use now Pastry DHT put(k;v2) hash(k) get(k) put(k;v1) put(k;v3) v1,v2,v3

45 45 EDA06 - Entrepôts de contenu45 Indexing in KadoP Use structured ID as in Xyleme Publish them in a DHT Use Holistic twig join Main issue: communications WAN vs. LAN Long posting lists Optimization techniques Use only docID [wisconsin] Ship smallest list Semi-join techniques Intensional indexing DHT put(C;[d,p,6,6,1]) put(John;[d,p,3,1,2]) hash(C) hash(John) DHT hash(C)

46 46 Conclusion

47 47 EDA06 - Entrepôts de contenu47 AXML and distributed data management on the Web Opinion: Xquery is a language for local XML management Language for distributed query management Active XML? What else? Foundation of distributed query optimization Recent proposal: AXML + send/receive KadoP and P2P (Active) XML indexing Now being tested and working on optimization ActiveXML is open-source – see KadoP soon will be – already available upon request Application: distribution of open-source software (with Mandriva) On going work

48 48 EDA06 - Entrepôts de contenu48 Other issues for turning the network into a scalable database Take an arbitrary problem for data or knowledge management and look at it in the P2P setting with Gigabytes of data Examples Self tuning (joint work with Alkis Polyzotis) Semantic integration (lots of work in Gemo) Distributed access control (joint work with Bogdan Cautis) Monitoring (joint work with DistribCom group in INRIA-Rennes)

49 49 EDA06 - Entrepôts de contenu49 Publicité Lancement de webContent Une plateforme RNTL Entrepôt de données du Web pour la surveillance EADS, Thales, Bongrain, Xyleme, Exalead, NewPhoenix Recherche de jeunes ingénieurs pour travailler dans webContent

50 50 EDA06 - Entrepôts de contenu50 Merci

