Presentation on theme: "1 Scoped and Approximate Queries in a Relational Grid Information Service Dong Lu, Peter A. Dinda, Jason A. Skicewicz Prescience Lab, Dept. of Computer."— Presentation transcript:
1 Scoped and Approximate Queries in a Relational Grid Information Service Dong Lu, Peter A. Dinda, Jason A. Skicewicz Prescience Lab, Dept. of Computer Science Northwestern University, Evanston, IL 60201
2 Outline Introduction and motivation –Powerful queries, but expensive to execute –Trade off between result size and query time Our solutions: Scoped query, Approximate query, Scoped Approximate query –Nondeterministic query (SC Talk on Tuesday) Performance Evaluation
3 What is RGIS? GIS: A Grid Information Service stores information about the resources and services in a distributed computing environment and answer queries about it. RGIS: Grid Information Service based on relational data model.
4 Why RGIS? 1.RGIS can answer complex compositional queries Relational algebra (SQL) Joins Difficult in a hierarchical model (directory service) 2.Other reasons Indexes separate from data model Schema evoluation Transactional insert/update/delete Consistency
5 RGIS Model of a Grid module endpoint maclink macswitch iplink router host connectorswitch connectorlink Annotated network topology graph Annotation examples –Hosts: memory, disk, OS, NICs, etc. –Router/Switch: backplane bandwidth, ports –Link: latency and bandwidth Highly dynamic data in streams, not DB Virtualization, Futures, Leases –Virtual machines Network Data link Physical Software
7 Challenge/Trade off Complex queries to a relational database can take a long time, –Hours, days or even weeks when we want seconds. Typically, returned result set is unnecessarily big. –Get back all results We need mechanisms to trade off the query time with the size of result set.
8 Challenge/Trade off All results Scoped results Nondeterministic results Approximate results
9 Example: Cluster Finder Cluster Routers IP links Hosts Find N hosts connected to the same router, with total memory N*512 MB, all running Linux, and the bisection bandwidth of The cluster is no less than 100Mbits/sec.
10 Original SQL for 2 Host Cluster Finder SELECT [scoped-approx] h1.distip, h2.distip FROM hosts h1, hosts h2, iplinks l1, iplinks l2, routers r WHERE h1.mem_mb+h2.mem_mb>=1024 and h1.os='linux' and h2.os='linux' and ((l1.src=r.distip and l2.src=r.distip and l1.dest=h1.distip and l2.dest=h2.distip) or (l1.dest=r.distip and l2.dest=r.distip and l1.src=h1.distip and l2.src=h2.distip)) and h1.distip<>h2.distip and L1.BW_MBS >= 100 AND L2.BW_MBS >= 100 [SCOPED BY r.distip=X] WITHIN 100 seconds; Original
11 Original SQL for Cluster Finder It is 2*N+1 way join to look for a N node cluster. Not scalable. Cluster 2 Routers IP links Hosts Cluster 1
12 Scoped Cluster Finder Routers IP links Hosts Query the hosts around a random router.
13 Scoped Cluster Finder SELECT H1.DISTIP, H2.DISTIP FROM HOSTS H1, HOSTS H2, IPLINKS L1, IPLINKS L2, ROUTERS R WHERE H1.MEM_MB+H2.MEM_MB>=1024 AND H1.OS='LINUX' AND H2.OS='LINUX' AND ((L1.SRC=R.DISTIP AND L2.SRC=R.DISTIP AND L1.DEST=H1.DISTIP AND L2.DEST=H2.DISTIP) OR (L1.DEST=R.DISTIP AND L2.DEST=R.DISTIP AND L1.SRC=H1.DISTIP AND L2.SRC=H2.DISTIP)) AND H1.DISTIP<>H2.DISTIP AND L1.BW_MBS >= 100 AND L2.BW_MBS >= 100 AND R.DISTIP = X; Scoped
14 Approximate Cluster Finder When searching for N hosts with total memory N*512, we can approximate the query with “search for N hosts with each having memory over 512”. Thus reduced or avoided the number of joins. However, this won’t find, say, N/2 hosts with 256 MB and N/2 hosts with 768 MB
15 Approximate Cluster Finder SELECT R.DISTIP, H1.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS >= 100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) AND R.DISTIP IN (SELECT R.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS>=100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) GROUP BY R.DISTIP HAVING COUNT(*) >= 2) ORDER BY R.DISTIP;
16 Scoped Approximate Cluster Finder Combine approximate query with scoped query. Scoped to one randomly chosen router at a time, if no results found, choose another random router and repeat the query. Approximate N host join for 512*N memory with searches for N hosts each with >=512. Always a THREE way join. –regardless of the size of the cluster being searched for. Thus very scalable. –may need to search multiple routers.
17 Scoped Approximate Cluster Finder SELECT H1.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS >= 100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) AND R.DISTIP=X AND ROWNUM <=2 The scoped approximate cluster finder has a fixed number of joins.
18 Time bounded queries The query rewriter will start the query as a child process. Parent kills the child process if no results returned within deadline.
19 Limitations of Scoped and Approximate queries The returned results are subset of original query, and it is possible to report no results while the original query could return results after running a long time. Not all queries can be written as Scoped or Approximate queries. It is hard to automate the Scoped and Approximate query rewriting.
20 Performance Evaluation Need to populate the database with large amount of data. Computational grids are still in early stages. –No large data sets available. –Use Smith MDS data for memory We generate synthetic grids that are representative of the Internet. –Can generate very large grids
21 GridG Generated Synthetic Grids Three-level network: WAN, MAN, LAN. Nodes on WAN, MAN are routers, while nodes on LAN are hosts. Links: IP links annotated with bandwidth and latency. Hosts: annotated with memory size, architecture, number of processors, CPU clock rate, disk size, etc. User can control all the distributions and the size of network.
23 Experimental Setup Dell PowerEdge 4400: dual Xeon 1 GHz processors, 2 GB memory, 240 GB RAID 5 storage system. Oracle 9i Enterprise edition, red hat Linux 7.1. Each test is repeated either 25 or 100 times, and we provide the average value.
25 Performance of Scoped Approximate Queries Cluster Finder : Find N hosts, each running Linux, with total memory at least N*512 MB, all connected to the same router, the bisection width is at least 100Mbits. –Our running example Non network query : Find N hosts with total memory at least N*512 MB. –No joins needed at all
26 Performance of Scoped Approximate Queries (2) Scalability with database size. Scalability with the complexity of queries. Scalability with concurrent users and update load.
33 Scalability with multiple concurrent users and background load Other research has shown that GIS servers will undertake frequent updating while serving the requests. GIS servers serve multiple concurrent users. Evaluate scoped approximate queries with concurrent users and update load. Concurrent users: execute queries repeatedly The update load: execute transactional updates on randomly selected hosts as fast as possible. –About 200 updates/second
34 Performance of Scoped Approximate Query (9.8K hosts, Cluster Finder, with Concurrent Users, looking for 64 nodes)
35 Performance of Scoped Approximate Query (9.8K hosts, Non network query, with Concurrent Users, looking for 64 nodes)
36 Conclusions Described and evaluated two query techniques to trade off query time with the size of result set: Scoped and Approximate query. Combination of Scoped and Approximate query can dramatically reduce response time and server load.
37 For more information GridG and Related paper: http://www.cs.northwestern.edu/~urgis/GridG “Synthesizing Realistic Computational Grids”, In proceedings of SC03. RGIS and Related paper: http://www.cs.northwestern.edu/~urgis/ “Nondeterministic Queries in a Relational Grid Information Service”, In proceedings of SC03.