Presentation on theme: "Indexing HDFS Data in PDW: Splitting the data from index 1 Vinitha Gankidi #, Nikhil Teletia *, Jignesh M. Patel #, Alan Halverson *, David J. DeWitt *"— Presentation transcript:
Indexing HDFS Data in PDW: Splitting the data from index 1 Vinitha Gankidi #, Nikhil Teletia *, Jignesh M. Patel #, Alan Halverson *, David J. DeWitt * # University of Wisconsin-Madison * Microsoft Jim Gray Systems Lab
Hybrid SQL-On-Hadoop solutions (Microsoft PolyBase, Teradata QueryGrid, IBM Big SQL etc.) RDBMS Motivation 2 HDFS Data lives in two worlds Cheap and scalable data store Cold Data Load first, schema later Familiar SQL interface Decades of research and optimization Hot Data SQL Server PDW with Polybase SQL Result Is it possible to run highly-selective queries on HDFS-resident cold data with both low latency and minimal system changes?
Query Execution over External Data 3 IDNameDeptID 101A1 102B2 103C3 IDNameDeptID 101A1 102B2 103C3 IDNameDeptID 101A1 102B2 103C3 Import HDFS files into PDW IMPORT PATH PUSH-DOWN PATH Run the rest of the query inside PDW 1 2 Import the result of the Map job into PDW Run the rest of the query inside PDW 2 3 Run a Map job to filter 1 SELECT * FROM hdfs_Employee WHERE DeptID = 1 HDFS The HDFS files have to be entirely imported Significant startup overhead for MAP task All the HDFS files are scanned entirely YES. How? By using an index Can we execute queries - without entirely scanning all the HDFS files without running a Map job?
What is a Split-Index? 1.Index is stored in RDBMS, while the data is in HDFS 2.Index is stored as a RDBMS table Hash-partitioned across multiple node Each partition has clustered B+ tree 4 HDFS IDNameDeptID 101A1 102B2 103C3 IDNameDeptID 101A1 102B2 103C3 IDNameDeptID 101A1 102B2 103C3 Dept ID HDFS File Name HDFS offset REC Len 1file1010 RDBMS Index Split-Index is similar to a materialized view (with an external pointer) Split-Index can be out-of-sync with the data
RDBMS Query Execution using Split-Index 5 SELECT name FROM hdfs_Employee WHERE DeptID = 1 Index_Emp (Index on DeptID) DeptIDFile Name offsetLen 1file1010 2file110 3file12010 DeptIDFile Name offsetLen 1file1010 2file110 3file12010 Dept ID HDFS File Name HDFS offset REC Len 1file1010 2file110 3file SELECT [HDFS File Name], [HDFS Offset], [Rec Len] FROM index_Employee WHERE DeptID = 1 2 HDFS File Name HDFS offset REC Len file1010 file Qualifying Tuples Name A … Retrieve qualifying tuples from HDFS files. 3 Return the result 4 IDNameDeptID 101A1 102B2 103C3 IDNameDeptID 101A1 102B2 103C3 IDNameDeptID 101A1 102B2 103C3 HDFS Using index, we can answer queries without having to sequentially scan each HDFS file.
Incremental Index Update Given the append-only property of the HDFS data, index can be updated incrementally A new HDFS file is added Append the rows of the new file to the existing index An HDFS file is deleted Delete the rows of the deleted file from the existing index 6
Hybrid Scan A stale Split-Index can still be used during query execution Examples: An HDFS file is added Scan the new file using non-index approach Process existing files using index An HDFS file is deleted When probing the index, remove the rows associated with the deleted file 7
Split-Index Performance Cluster 9 Node SQL Server PDW cluster (8 compute nodes + 1 control node) 29 Node Windows HDP 2.0 cluster (28 data nodes + 1 name node) Data Set 10 TB Scale Lineitem table Compare Push-Down approach with Split-Index approach 9 Push-Down Approach Map Cost Data Import Cost Split-Index Approach RID Materialize Cost Data Import Cost COST
Split-Index Performance 10 SELECT * FROM lineitem WHERE l_orderkey <= [Variable] Data Size: ~800GB Index Size: ~80GB Split-Index on l_orderkey Push-Down Approach Map Cost Data Import Cost Split-Index Approach RID Materialize Cost Data Import Cost Index performance is sensitive to the access pattern.
Space vs. Time Trade off Cost of storing the data in RDBMS is higher compared to HDFS Split-Index can be used as a covering index Quantify the performance and space trade-off as we move columns from HDFS to PDW Experiment Setup 1 TB Scale Lineitem Modified Query 6 11 SELECT SUM(l_extendedprice*l_discount) AS REVENUE FROM lineitem WHERE l_shipdate >= ' ' AND l_shipdate < dateadd(mm, 1, cast(' ' as date)) AND l_discount BETWEEN AND AND l_quantity < 24
Space vs. Time Trade off 12 The Lineitem table is in HDFS. No index. (Push-Down) The Lineitem table is in HDFS Split-Index on l_shipdate The Lineitem table is in HDFS Split-Index on l_shipdate, l_discount, l_quantity, l_extendedprice The Lineitem table is in PDW. No index. Split-Index can be used to balance the query execution time and the PDW disk footprint
Conclusions and Future Work 13 A simple “Split-Index” mechanism can be used to achieve low-latency on highly-selective queries, with minimal system changes Incremental index update reduces the cost of maintaining the Split-Index; Hybrid scan allows using the stale Split-Index Future Work: Query optimization to use the Split-Index, and automatic physical schema designer for the Split Index(es).