Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11.

Similar presentations


Presentation on theme: "1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11."— Presentation transcript:

1 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

2 2 Introduction

3 3 P2P Data Management, 2006, S. Abiteboul3/89 Success stories at the time of the Internet bubble Google: management of Web pages Mapquest: management of maps Amazone: book catalogue eBay: product catalogue Napster (emule, bearshare, etc.): music database Flickr: picture database Wikipedia: dictionary del.icio.us: annotations In France: Meetic: dating database Kelkoo: comparative shopping They are all about publishing some database

4 4 P2P Data Management, 2006, S. Abiteboul4/89 The trend is towards peer-to-peer and interactivity P2P: A large and varying number of computers cooperate to solve some particular task without any centralized authority Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network seti@home; kazaa; cabal Switch from centralized servers to communities and syndication Interaction and Web 2.0 Motivations: Social, organizational

5 5 P2P Data Management, 2006, S. Abiteboul5/89 Information management in a P2P network Private terminology: data ring Information is heterogeneous, distributed, replicated, dynamic Which info: Data + meta-data + knowledge + services Peers are heterogeneous, autonomous and possibly mobile From sensors to PDA to mainframe Typically very large number of peers Variety of requirements: QoS, performance, security, etc.

6 6 P2P Data Management, 2006, S. Abiteboul6/89 Acknowledgement Xyleme: Scalable XML warehousing Sophie Cluet, Guy Ferran (Xyleme) & many others ActiveXML: Language for P2P data management Omar Benjelloun (Google), Ioana Manolescu, Tova Milo (Tel Aviv) & many others KadoP: P2P scalable XML indexing Ioana Manolescu, Nicoleta Preda & others Data Ring: Infrastructure for P2P data management Alkis Polyzotis (UC Santa Cruz)

7 7 P2P Data Management, 2006, S. Abiteboul7/89 Outline 1.Introduction – the data ring 2.Calculus for P2P data management(ActiveXML) 3.Algebra for P2P data management (ActiveXML algebra) 4.Indexing in P2P (KadoP) 5.Conclusion Goal of the tutorial: present issues and technology on p2p information management Warning: it is very biased – it is not a survey

8 8 P2P Data Management, 2006, S. Abiteboul8/89 Outline 1.Introduction – the data ring 2.Calculus for P2P data management(ActiveXML) 3.Algebra for P2P data management (ActiveXML algebra) 4.Indexing in P2P (KadoP) 5.Conclusion

9 9 1. Introduction – the data ring

10 10 P2P Data Management, 2006, S. Abiteboul10/89 The information in a data ring Data: tuples, collections, documents, relations… Services: data sources, possibly some processing Meta-data about resources: attribute/values pairs, annotations Ontologies to explain data and metadata View definitions Data integration information, e.g., mappings between ontologies Physical data: Indices and materialized views

11 11 P2P Data Management, 2006, S. Abiteboul11/89 Functionalities of the data ring Storage, persistence, replication Indexing, caching, querying, updating, optimization Schema management, access control Fault tolerance, self tuning, monitoring Resource discovery, history, provenance, annotations, multi-linguism, Semantic enrichment, uncertain data Each functionality may be achieved by a peer or by the network

12 12 P2P Data Management, 2006, S. Abiteboul12/89 And now, what is a peer? A mainframe database A file system Web server A PC A PDA A telephone A sensor A home appliance A car A manufacturing tool A telecom equipment A toy Another data ring Any connected device or software with some information to share A net address and some names of resources (e.g. document or service)

13 13 P2P Data Management, 2006, S. Abiteboul13/89 Advantages and disadvantages of P2P Scaling Performance Optimization of parallelism Avoid bottleneck Replication Availability Replication Cost Avoid the cost of server Share operational cost Dynamicity add/remove new data sources Complexity Performance Cost for complex queries Communication cost Availability Peers can leave Consistency maintenance Difficult to support transaction Quality Difficult to guarantee quality

14 14 P2P Data Management, 2006, S. Abiteboul14/89 Crash course on Web standards Data exchange format = XML Labeled ordered trees Its main asset = XML schema There is much more Distributed computing protocol = Web services SOAPSimple Object Access Protocols WSDL Web Service Definition Language UDDI Universal Description, Discovery and Integration BPEL Business Process Execution Language Query languages = XPath and XQuery Declarative query language for XML + full-text + update language Knowledge representation = Owl or RDF/S SOAP WSDL XML Xquery Xpath Owl RDFS

15 15 Information used to live in islands but with the Web, this is changing: uniform access to information… …the dream for distributed data management

16 16 P2P Data Management, 2006, S. Abiteboul16/89 Do you like the standards? It is the wrong question! Correct questions: What can you do with it? What is missing? Is Xquery the ultimate query language for the Web? No It is a language for querying centralized XML We will see what it is missing We will not talk much about semantics

17 17 P2P Data Management, 2006, S. Abiteboul17/89 Automatic and distributed management of the data ring No centralized server No information administrator (no info manager) Most users are non-experts E.g., scientists Requirements Ease of deployment (zero-effort) Ease of administration (zero-effort) Ease of publication (epsilon-effort) Ease of exploitation (epsilon-effort) Participation in community building notably via annotations Happy database admin

18 18 P2P Data Management, 2006, S. Abiteboul18/89 What should be made automatic Self-statistics from the monitoring of the data ring In particular, define the statistics that are needed* Self-tuning based on the self-statistics Choose the most appropriate organization Decide to install access structures: indexes, views, etc. Control replication of data and services Self-healing Recovery from errors E.g., replacement of a failing Web service And automatic file management

19 19 P2P Data Management, 2006, S. Abiteboul19/89 Any hope? Technology exists (database self-tuning, machine learning, etc.) But self-tuning for databases has advanced very slowly Why can this work? 1.There is no alternative (for db, this was just a cool gadget) 2.KISS (keep it simple stupid!) 3.The power of parallelism This is assuming lots of machine have free cycles (true) and bandwidth is generous (not always true)

20 20 P2P Data Management, 2006, S. Abiteboul20/89 Distributed access control Goal: Control access to ring resources Access to resources is based on access rights (ACL) Who is controlling ACLs? A node manages ACLs for a collection of distributed resources Easy but against the spirit and possible bottleneck The network manages access control Anybody can get the data The data is published with encryption and signatures; only nodes with proper access rights can perform reads/writes Some techniques exist

21 21 P2P Data Management, 2006, S. Abiteboul21/89 Monitoring What is monitored? Web service calls and database updates The Web –Web pages –RSS feed What is produced? A stream of events –As a continuous service –As a RSS feed –As a Web site/page Info-surveillance Self-statistics and tracing Basis for error diagnosis

22 22 P2P Data Management, 2006, S. Abiteboul22/89 Streams are everywhere In query processing In indexing (KadoP) In recursive queries (AXML-QSQ) In messaging, monitoring and pub/sub That is why we will use an algebra over streams of trees and not simply an algebra over trees

23 23 P2P Data Management, 2006, S. Abiteboul23/89 Example: Edos distribution system A system for the management of Linux distribution Joint work with Mandriva Software and U. Tel Aviv Community of open-source developers: thousands System releases: about 10 000 software packages + metadata Functionalities Query the metadata Query subscription Retrieve packages Publish a new release or update an existing one

24 24 P2P Data Management, 2006, S. Abiteboul24/89 Exemple: WebContent WebContent: an ANR platform for the management of web content Web surveillance Business, technical, web watching… Participation of Gemo WP3: knowledge WP5: P2P content management Partners: CEA, EADS, Thales, Bongrain, Xyleme, Exalead, many research groups (UVSQ, Grenoble, Paris 6, etc.)

25 25 P2P Data Management, 2006, S. Abiteboul25/89 Taxonomy of such applications Parameters Number of peers and quantity of data How volatile the peers are The query/update workload The functionalities that are desired Edos: peers and documents in thousands, mostly append for updates, peers not too volatile An extreme: Google search engine in P2P for billions of documents using millions of hyper volatile peers Mostly interested in the first case

26 26 P2P Data Management, 2006, S. Abiteboul26/89 Thesis XQuery is fine for local XML processing and publishing Not sufficient for distributed data management The success of the relational model, i.e., of tables on a server : 1. A logic for defining tables 2. An algebra for describing query plans over tables By analogy, we need for trees in a P2P system 1. A logic for defining distributed tree data and data services 2. An algebra for describing query plans over these Proposal: ActiveXML logic and algebra

27 27 P2P Data Management, 2006, S. Abiteboul27/89 Outline 1.Introduction – the data ring 2.Calculus for P2P data management(ActiveXML) 3.Algebra for P2P data management (ActiveXML algebra) 4.Indexing in P2P (KadoP) 5.Conclusion

28 28 2. Active XML: a logic for distributed data management

29 29 P2P Data Management, 2006, S. Abiteboul29/89 The basis AXML is a declarative language for distributed information management and an infrastructure to support the language in a P2P framework Simple idea: XML documents with embedded service calls Intensional data Some of the data is given explicitly whereas for some, its definition (i.e. the means to acquire it when needed) is given Dynamic data If the data sources change, the same document will provide different information

30 30 P2P Data Management, 2006, S. Abiteboul30/89 Example (omitting syntactic details) Aspen Unisys.com/snow(Aspen) 1 …. Yahoo.com/GetHotels( ) … May contain calls to any SOAP web service : e-bay.net, google.com… to any AXML web services to be defined

31 31 P2P Data Management, 2006, S. Abiteboul31/89 Marketing Philosophy Manon: Whats the capital of Brazil? Dad: Lets ask Wikipedia.com! Manon: How do I get a cheap ticket to Galapagos? Dad: Lets place a subscription on LastMinute.com! Manon: What are the countries in the EC? Dad: France, Germany, Holland, Belgium, and hum… Lets ask YouLists.com for more! Active answer = intensional and dynamic and flexible Embedding calls in data is an old idea in database

32 32 P2P Data Management, 2006, S. Abiteboul32/89 Active XML peer Peer-to-peer architecture Each Active XML peer Repository: manages Active XML data Web client: calls the services inside a document Web server: provides (parameterized) queries/updates over the repository as web services Exchange of AXML instead of XML AXML peer soap

33 33 P2P Data Management, 2006, S. Abiteboul33/89 What is an AXML peer? PC Now open source ObjectWeb – queries in OQL Peer on a mass storage system eXist (open source XML database) queries in XQuery Xyleme queries in XyQL PDA or cell phone Persistence in file system and XPATH On going: the entire network Data is stored in a P2P network - KadoP More: java card, a relational database…

34 34 P2P Data Management, 2006, S. Abiteboul34/89 A key issue: call activation When to activate the call? Explicit pull mode: active databases Implicit pull mode: deductive databases Push mode: query subscription What to do with its result? How long is the returned data valid? Mediation and caching Where to find the arguments? Under the service call: XML,XPATH or a service call

35 35 P2P Data Management, 2006, S. Abiteboul35/89 Hi John, what is the phone number of the Prime Minister of France? Find his name at whoswho.com then look in the phone dir Look in the yellow pages for deVillepins in phone dir of www.gov.fr (33) 01 56 00 01 Another key issue: what to send? Send some AXML tree t As result of a query or as parameter of a call The tree t contains calls, do we have to evaluate them? If I do, I may introduce service calls, do we have to evaluate all these calls before transmitting the data?

36 36 P2P Data Management, 2006, S. Abiteboul36/89 Active XML cool idea – complex problems Blasphemous claim: Active XML is the proper paradigm for data exchange! Not XML + not XQuery Brings to a unique setting distributed db, deductive db, active db, stream data warehousing, mediation This is unreasonable? Yes! Plenty of works ahead… to make it work But first, the algebra

37 37 P2P Data Management, 2006, S. Abiteboul37/89 Outline 1.Introduction – the data ring 2.Calculus for P2P data management(ActiveXML) 3.Algebra for P2P data management (ActiveXML algebra) Query processing Query optimization 4.Indexing in P2P (KadoP) 5.Conclusion

38 38 3. Active XML algebra

39 39 P2P Data Management, 2006, S. Abiteboul39/89 Motivation Relational model: centralized tables optimization: algebraic expression and rewriting Active XML model: distributed trees optimization: algebraic expression and rewriting Distributed query optimization based on algebraic rewriting of Active XML trees Based on experiences with AXML optimization

40 40 P2P Data Management, 2006, S. Abiteboul40/89 Active XML peers We focus on positive AXML Set-oriented data Positive/monotone services Services = tree-pattern-query-with-join queries Services produce streams Optimized by a local query optimizer Evaluated by a local query processor Out of our scope π join π output stream input stream input stream Local query processing

41 41 P2P Data Management, 2006, S. Abiteboul41/89 The problem An AXML system A set of peers For each peer a set of documents and services Extensional data is distributed Intensional data (knowledge) is distributed –Defined using query services (TPQJ queries) –These services are generic: any peer can evaluate a query A query q to some peer Evaluate the answer to q with optimal response time

42 42 P2P Data Management, 2006, S. Abiteboul42/89 (AXML) algebraic expressions: AXML algebra AXML logic l E1E2 En … s@p E1E2 En … d@p send@p #n@pE1 eval@p E1 receive@p E1 Each such expression lives at some peer Includes the AXML trees

43 43 P2P Data Management, 2006, S. Abiteboul43/89 Algebraic expressions annotations Executing service call: Terminated service call: Subtlety q@p(5): definition of intensional data eval(q@p(5)): request to evaluate it; during query optimization q@p(5): query is being evaluated; during query processing q@p(5): query evaluation is complete

44 44 P2P Data Management, 2006, S. Abiteboul44/89 Evaluation rules: local rules for l sc, s send, receive l t1t2 tn … eval@p l t1 eval@p t2 eval@p tn … tn s@p t1t2 … eval@p s@p eval@p t1 eval@p t2 eval@p tn … E1 eval@p E1 eval@p

45 45 P2P Data Management, 2006, S. Abiteboul45/89 Evaluation rules: transfer rules Site p asks p to do the work and send the result to p s@p t1t2 … eval@p #x@p s@p t1t2 … receive@p #x@p s@p t1t2 … eval@p newRoot()@p send@p #x@p &

46 46 P2P Data Management, 2006, S. Abiteboul46/89 Evaluation rules: more transfer rules When a query is evaluated, results appear They are sent to the place that requested them Also some rules for eof & #x@p Z … send@p eval@p s@p … #x@p Z … send@p eval@p s@p … #x@p Z …

47 47 P2P Data Management, 2006, S. Abiteboul47/89 Evaluation Reminder: setting An AXML system A request to evaluate query q at peer p – eval@p( q ) Rewrite the trees in peer workspaces until termination of the process Results For positive XML, this process converges… to a possibly infinite state This process computes the answer to q May be fairly inefficient: need for optimization!

48 48 Optimization More rewrite rules to evaluate a query more efficiently

49 49 P2P Data Management, 2006, S. Abiteboul49/89 Query optimization Well-known optimization techniques for distributed data management Pushing selections Semijoin reducers Horizontal, vertical, hybrid decomposition Recursive query processing and query-subquery Some specific AXML optimizations Pushing queries over service calls Lazy service call evaluation … Optimizing subscription management All are captured by the algebraic framework

50 50 P2P Data Management, 2006, S. Abiteboul50/89 Example: pushing selections Same rule applies if d@p2 is replaced by a continuous query q eval@p d@p2 q1@p eval@p @p2(q2@p2(d@p2)) q1 eval@p (q2(d@p2)) q1@p eval@p eval@p2 (q2(d)) Suppose q q1( (q2)):

51 51 P2P Data Management, 2006, S. Abiteboul51/89 Example: interleaving of processing and optimization At peer i: d i = r i d i+1 Query at p 1 : (d 1 ) 1. (d 1 ) (r 1 ) (d 2 ) eval@p 1 ( (r 1 ) (d 2 )) eval@p 1 ( (r 1 )) eval@p 1 ( (d 2 )) eval@p 1 ( (r 1 )) (r 1 ) (starts streaming data) 2. (d 2 ) (r 2 ) (d 3 ); (r 2 ) starts streaming data 3. (d 3 ) (r 3 ) (d 4 ) … (r1) (r2) (r3) (r4) …

52 52 P2P Data Management, 2006, S. Abiteboul52/89 Transfer and load balancing rules Peer p1 delegates the evaluation of E to p2 E eval@p1 #x@p1 eval@p2 send@p2 #x@p1 newRoot@p2() eval@p1 #x@p1 send@p1 E

53 53 P2P Data Management, 2006, S. Abiteboul53/89 Transfer and load balancing rules Peer p1 delegates the evaluation of E to p2 eval(E) #x@p1 send@p2 #x@p1 newRoot@p2() eval@p1 #x@p1 send@p1 eval@p2 E

54 54 P2P Data Management, 2006, S. Abiteboul54/89 Transfer and load balancing rules Peer p1 delegates the evaluation of E to p2 eval(E) #x@p1 newRoot@p2() #x@p1 send@p2 #x@p1 eval@p2 E &

55 55 P2P Data Management, 2006, S. Abiteboul55/89 Transfer and load balancing rules Peer p1 delegates the evaluation of E to p2 eval(E) #x@p1 newRoot@p2() #x@p1 send@p2 #x@p1 eval(E) &

56 56 P2P Data Management, 2006, S. Abiteboul56/89 Transfer and load balancing rules Peer p1 delegates the evaluation of E to p2 #x@p1 eval(E) newRoot@p2() #x@p1 send@p2 #x@p1 eval(E) &

57 57 P2P Data Management, 2006, S. Abiteboul57/89 Transfer and load balancing rules Peer p1 delegates the evaluation of E to p2 E eval@p1 #x@p1 eval@p2 send@p2 #x@p1 newRoot@p2() eval@p1 #x@p1 send@p1 E

58 58 P2P Data Management, 2006, S. Abiteboul58/89 Back to interleaved execution and optimization Data transfers reduced More work for p1: merging all the streams (r1) (r2) (r3) (r4) … (r1) (r2) (r3) (r4) … Hierarchical stream merging … … Repeated transfers

59 59 P2P Data Management, 2006, S. Abiteboul59/89 Example: Horizontal and vertical decomposition A relation d over ABC that is split both horizontally and vertically d = (d1 d2) d3 d1 = B =5 (d') d', d1, d2 over AB and d3 over BC; each di is at a peer pi Consider the query B=0 @p(d) B=0 @p( ( B =5 (d'))) d3@p3 ) B=0 @p( d1@p1 d3@p3 ) @p (#x@p receive(d1@p1), #y@p receive(d3@p3) ) & send@p1(#x@p; B=0 @p1(d1@p1) ) & send@p3(#y@p; d3@p3)

60 60 P2P Data Management, 2006, S. Abiteboul60/89 Common sub-expression elimination eval@p(E), #x@p receive@p(E) eval@p(#x@p), #x@p receive@p(E) eval@p E #x@p E receive@p eval@p #x@p E receive@p #x@p & &

61 61 P2P Data Management, 2006, S. Abiteboul61/89 In spite of the two calls to s@p, the function is evaluated only once Common sub-expression elimination eval@p q@p s@p q@p #x@p s@p receive@p s@p eval@p newRoot()@p send@p #x@p s@p & newRoot()@p send@p #y@p #x@p q@p #x@p receive@p s@p #y@p receive@p s@p newRoot()@p send@p #x@p s@p & &

62 62 P2P Data Management, 2006, S. Abiteboul62/89 Example: recursive query processing Using a pseudo Datalog syntax s1@p($x, $y) d2@p'($x, $z), s2@p'($z, $y) s2@p'($x, $y) d1@p($x, $z), s1@p($y, $z) After rewriting: on p : #x@p receive@p(q1@p'(d2@p', s2@p') ) root@p send@p(#y@p', q2@p( d1@p, #x@p) ) on p' : root@p' send@p'(#x@p, q1@p'(d2@p', #y@p' receive@p'(s2@p') ) )

63 63 P2P Data Management, 2006, S. Abiteboul63/89 q@any: where q is a query Any peer that has some query processor for q can do it f@any: where f is a processing service call Example: decryption or gene comparison q over a P2P collection Generic and global services eval@p q @ coll eval@p q @ index q eval@p q@p1 eval@p q@p2

64 64 P2P Data Management, 2006, S. Abiteboul64/89 The AXML algebra – conclusion Captures distributed XML query processing/optimization Based on a communication model a la CCS Algebraic – stream-oriented Orthogonal to the local XML query optimizer Orthogonal to the network support (DHT, small world etc.) What is not yet available? A cost model and heuristics

65 65 P2P Data Management, 2006, S. Abiteboul65/89 Outline 1.Introduction – the data ring 2.Calculus for P2P data management(ActiveXML) 3.Algebra for P2P data management (ActiveXML algebra) 4.Indexing in P2P (KadoP) 5.Conclusion

66 66 4. P2P XML indexing and query processing

67 67 P2P Data Management, 2006, S. Abiteboul67/89 Efficient evaluation of tree-pattern-queries Many optimization techniques We are interested here in distributed query evaluation/optimization 1) We consider XML indexing 2) Holistic twig join that is based on indexing 3) P2P indexing 4) P2P query processing 5) Optimizing P2P indexing

68 68 P2P Data Management, 2006, S. Abiteboul68/89 XML indexing: structural identifiers 1 2 3 5 6 7 8 4 6 6 6 8 8 8 X ancestor of Y pre(X) < pre(Y) and post(X) post(Y) X parent of Y X ancestor of Y and level(X) = level(Y) - 1 Structural IDs = Prefix-Postfix A B D E C F John G 0 1 1 2 2 2 3 4 4 3 -Level

69 69 P2P Data Management, 2006, S. Abiteboul69/89 Holistic Twig Join Input a document and a tree pattern query Find the bindings of the query in the document Holistic = holistique (le tout et pas juste les parties) Twig = brindille Join = you know… Sounds like Harry Potter?

70 70 P2P Data Management, 2006, S. Abiteboul70/89 Query evaluation over a document Ids for A (1,8,0)… Ids for D Ids for John John Ids for C Ids are sorted in lexicographical order Goals is to find matching Ids A D C

71 71 P2P Data Management, 2006, S. Abiteboul71/89 The Holistic Twig Join Algorithm a b c a (3,5 c (4,5) a (5,5) c (6,8) c (7,8) a (8,8) b (10,11) c (11,11) b (12,14) b (13,14) c (14,14) a (16,17) c (17,17) b (19,22) b (20,21) c (22,22) c (23,25) b (24,25) c (25,25) a (2,8) c (9,14)a (15,17) a (18,25) c (21,21) r (1,25) 0 1 2 3 4 level

72 72 P2P Data Management, 2006, S. Abiteboul72/89 Legend: c 11 c9c9 a c8c8 c7c7 a6a6 b3b3 c6c6 c4c4 The Holistic Twig Join Algorithm a1a1 a2a2 a3a3 a4a4 a5a5 a7a7 b1b1 b2b2 b4b4 b5b5 c1c1 c2c2 c3c3 c5c5 c9c9 b c (a 7, b 4, c 8 ), (a 7, b 5, c 8 ), ScSc SbSb SaSa a7a7 b4b4 c8c8 b6b6 b5b5 c 10 (a 7, b 4,c 9 ) b6b6 c 11 (a 7,b 6,c 11 ) Head of the stream Find the match for the query sub-tree determined by this node !!! The ID is present also in the stack Stacks This is the end

73 73 P2P XML processing

74 74 P2P Data Management, 2006, S. Abiteboul74/89 XML indexing in Xyleme History 1999: INRIA research project 2000: Creation of a spin-off 2006: About 25 people Technology A scalable XML repository A content warehouse On a cluster of Linux PC XML query processing Twig join Index is distributed Keyword-based vs. document based LAN Put(C;[d,p,6,6,1]) Put(John;[d,p,3,1,2]) hash(C) hash(John)

75 75 P2P Data Management, 2006, S. Abiteboul75/89 Query processing over a distributed collection Ids for A (p12,d456, 1,7,0)… Ids for D Ids for John A D C John Ids for C Ids include peerId and docId Ids are sorted in lexicographical order Goals is to find matching Ids in the collection

76 76 P2P Data Management, 2006, S. Abiteboul76/89 XML indexing in KadoP Use structural Ids Publish them via a DHT Distributed Hash Table Peers come and go Locate(k):: log(n) messages to fing the peer in charge of key k Put(k,v) Get(k): retrieves all the values for k We use Pastry We also tried P2PSim and JXTA DHT put(C;[d,p,6,6,1]) put(John;[d,p,3,1,2]) hash(C) hash(John) posting for C put(C;[d,p,6,6,1])

77 77 P2P Data Management, 2006, S. Abiteboul77/89 XML query processing in KadoP Given a tree pattern query Q Evaluate an index query indexQ to locate the peers that can provide some answers indexQ is a twig join Ship Q to these peers and evaluate it there If indexQ is imprecise, many false positive Example: ship Q to all peers (maximal parallelism) Example: Instead of structural Ids, just use (label/word,peerId,docId)

78 78 P2P Data Management, 2006, S. Abiteboul78/89 KadoP architecture KadoP peer publish & query KadoP Engine DHT locate, put, get & delete Index Indexing Query processing ActiveXML engine Web interface Semantic layer External Layer Logical Layer Physical Layer

79 79 P2P Data Management, 2006, S. Abiteboul79/89 Some technical issues Our goal: manage millions of documents with a large number of peers First experiments were a disaster Replace the index storage of the DHT in a FS by storage in a database (Berkeley DB) Extend the API of the DHT to have Append and not only Read/Write Extend the API of the DHT to have a streaming exchange of postings Useful because the XML algebra works better with streams Now it scales but there is the issue of long postings

80 80 P2P Data Management, 2006, S. Abiteboul80/89 The issue of long postings: Google in P2P Using keyword distribution Suppose Peer for Ullman is in Europe Peer for XML is in US we have to ship one long posting between US and Europe For a large number of users, we absorb all the bandwidth of Internet backbone Need for replication Even for thousands of peers, the exchange of long postings is an issue DHT Ullman xml Ullmann & xml?

81 81 P2P Data Management, 2006, S. Abiteboul81/89 Intensional indexing in KadoP long posting h(Name) Distributed B-tree Long posting = bad response time 1.No long posting 2.get h(name) then parallel fetch 3.Possibility to optimize further f(docId55..docId75) may be it does not match no need to call f h(Name) f g h i

82 82 P2P Data Management, 2006, S. Abiteboul82/89 More optimization Standard for P2P keyword search –Gap compression and adaptive set intersection Standard distributed query optimization techniques –Ship smallest list –Load balancing –Caching –Replication Semi-join techniques notably Bloom semi-join

83 83 P2P Data Management, 2006, S. Abiteboul83/89 Outline 1.Introduction – the data ring 2.Calculus for P2P data management(ActiveXML) 3.Algebra for P2P data management (ActiveXML algebra) 4.Indexing in P2P (KadoP) 5.Conclusion

84 84 6. Conclusion

85 85 P2P Data Management, 2006, S. Abiteboul85/89 Logic for distributed data management Opinion: XQuery is a language for local XML management Proposal: ActiveXML Algebraic foundation of distributed query optimization Proposal: ActiveXML algebra P2P (Active) XML indexing KadoP is now being tested and we are working on optimization Software ActiveXML is open-source – see activexml.net KadoP soon will be – already available upon request EDOS distribution system as well Conclusion

86 86 P2P Data Management, 2006, S. Abiteboul86/89 Lots of related work and related systems This is going very fast in system devepments Structured P2P nets: Pastry, Chord Content delivery net: Coral, Akamai XML repositories: Xyleme, DBMonet Multicas systemst: Avalanche, Bullet File sharing systems: BitTorrent, Kazaa Pub/Sub systems: Scribe, Hyper Distributed storage systems: OceanStore, GoogleFS Etc. Fundamental research is somewhat left behind

87 87 P2P Data Management, 2006, S. Abiteboul87/89 Issues P2P query optimization P2P access control P2P archiving P2P self tuning P2P monitoring P2P knowledge management [SomeWhere] Also: analysis and verification of these systems E.g., termination, error detection, diagnosis

88 88 P2P Data Management, 2006, S. Abiteboul88/89 Find your own topic Pick your favorite problem for data or knowledge management and study it in a P2P setting with gigabytes of data and thousands of peers If you find it boring, consider it with terabytes of data and millions of peers

89 89 P2P Data Management, 2006, S. Abiteboul89/89 Merci


Download ppt "1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11."

Similar presentations


Ads by Google