We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byHope Gragg
Modified over 2 years ago
© Hortonworks Inc MapReduce over snapshots HBASE-8369 Enis Soztutar Enis [at] apache [dot] Page 1
© Hortonworks Inc About Me Page 2 Architecting the Future of Big Data In the Hadoop space since 2007 Committer and PMC Member in Apache HBase and Hadoop Working at Hortonworks as member of Technical Staff
© Hortonworks Inc Snapshots Currently a snapshot is a bunch of reference files together with some metadata A table snapshot can contain –Table descriptor –List of regions –References to files in the regions –References to WALs for regionservers Current snapshot impl is flush based –Forces flush to all regions, so that in-memory data is written to disk Page 3 Architecting the Future of Big Data
© Hortonworks Inc MR over Snapshots Idea is do scans on the client side bypassing region servers Use snapshots since they are immutable Similar to short circuit hdfs reads TableSnapshotInputFormat works similar to TableInputFormat TableMapReduceUtil methods to configure the job Page 4 Architecting the Future of Big Data
© Hortonworks Inc Deployment Options HBase online Take snaphot while HBase is running Run MR job over the snapshot HBase offline Take snapshot while HBase is running Export Snapshot using ExportSnapshot to a different hdfs Run MR job over snapshot with or without HBase running Page 5 Architecting the Future of Big Data
© Hortonworks Inc TableSnapshotInputFormat Gets a Scan representing the query Restore the snapshot to a temporary directory For each region in the snapshot: –Determine whether the region should be scanned (falls between scan start row and stop row) –Create one split per region in the scan range ( # of map tasks) –Each RecordReader will open the region (Hregion) as in HRegionServer –An internal RegionScanner is used for running the scan Page 6 Architecting the Future of Big Data
© Hortonworks Inc API Page 7 Architecting the Future of Big Data
© Hortonworks Inc Timeline Will (hopefully) be committed to trunk next week or so Interest in bringing this to 0.94 and 0.96 bases as well Will come in HDP-2.1, which will be based on 0.96 line Page 8 Architecting the Future of Big Data
© Hortonworks Inc Security Aspects HBase user owns the files in filesystem Snapshot files are also owned by the HBase user Mapreduce job should be able to read the files in the snapshot + actual data files HDFS only has posix-like perms based on user/group/other –User running MR job has to be either the HBase user, or have group perms –HDFS does not have ACLs, so there is no easy way to grant read access at filesystem layer Idea: similar to current short circuit impl, we can implement a FD transfer –User will submit jobs under her own user credentials –Ask HBase daemons to open the files, and pass a handler / token Page 9 Architecting the Future of Big Data
© Hortonworks Inc Performance ScanTest: Scan : open a scanner, do full table scan SnapshotScan : open a client-side scanner, do full table scan ScanMR : parallel full table scan from MR SnapshotScanMR : do full table scan 8 Region servers, 6 disks each HBase trunk Hadoop-2.2 (HDP ) Load data with IntegrationTestBulkLoad –Evenly distributed rows, created as bulk loaded hfiles. 3 column families # store files per region varies 3,6,9, and 12 (1,2,3,4 file per store) Data sizes: 6.6G, 13.2G, 19.8G, 26.4G Page 10 Architecting the Future of Big Data
© Hortonworks Inc Scan speed Page 11 Architecting the Future of Big Data
© Hortonworks Inc API We do not want to limit snapshot scanning only to MapReduce Allow client side scanners over snapshot files Page 12 Architecting the Future of Big Data
© Hortonworks Inc ResultScanner is main scan API Page 13 Architecting the Future of Big Data
© Hortonworks Inc API (caution: not final yet) Page 14 Architecting the Future of Big Data
© Hortonworks Inc To the future and beyond HBASE-8691 High-Throughput Streaming Scan API Can we bypass regionservers without taking snapshots? Bypass memstore data, or stream memstore data, but read directly from hfiles Secure reading from snapshots Keep up with the updates at –https://issues.apache.org/jira/browse/HBASE-8369 Page 15 Architecting the Future of Big Data
© Hortonworks Inc Thanks Questions? Architecting the Future of Big Data Page 16 Enis Söztutar enis [ at ] apache [dot]
Distributed and Parallel Processing Technology Chapter2. MapReduce Sun Jo 1.
Beyond Mapper and Reducer Rozemary Scarlat September 13, 2011 Partitioner, Combiner, Hadoop Parameters and more.
Chen Zhang Hans De Sterck University of Waterloo Supporting Multi-row Distributed Transactions with Global Snapshot Isolation Using Bare-bones HBase.
1 File Systems: Fundamentals. 2 Files What is a file? A named collection of related information recorded on secondary storage (e.g., disks) File attributes.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
NEFIS (WP5) Evaluation Meeting, November 2004 Evaluation Data Rights Aljoscha Requardt, University of Hamburg Response Rate: 91% - 10 of 11 partners.
Author: Graeme C. Simsion and Graham C. Witt Chapter 4 Subtypes & Supertypes.
AQute R4 MEG By Peter Kriens CEO aQute OSGi Technology Officer and OSGi Fellow.
OS Organization Continued Andy Wang COP 5611 Advanced Operating Systems.
HBase and Hive at StumbleUpon Jean-Daniel Cryans DB Engineer at StumbleUpon HBase
MapReduce Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan.
Slide 1 FastFacts Feature Presentation February 17, 2011 We are using audio during this session, so please dial in to our conference line… Phone number:
HORIZONT 1 XINFO ® The IT Information System HORIZONT Software for Datacenters Garmischer Str. 8 D München Tel ++49(0)89 /
MULTIMEDIA What is Multimedia? The word MULTIMEDIA is made up from two words, MULTI meaning more than one and MEDIA meaning a method of displaying or passing.
More on File Management Chapter 12. File Management provide file abstraction for data storage guarantee, to the extend possible, that data in the file.
Global Payroll Performance Optimisation - I David Kurtz Go-Faster Consultancy Ltd.
© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Case studies September 24, 2013.
Security middleware Andrew McNab University of Manchester.
1 Revision Control With Subversion An Overview, Targeted at Former-CVS Users.
Design and Implementation Issues Today Design issues for paging systems Implementation issues Segmentation Next I/O.
1 Displaying Open Purchase Orders (F/Y 11). 2 At the end of this course, you should be able to: –Run a Location specific report of all Open Purchase.
Query Optimizer Overview Conor Cunningham Principal Architect, SQL Server Query Processor 1.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 3: Processes.
Titel bitte hier angeben! Swyx Technology Conference 2011 Smart Call Routing with Persistent Variables Tom Wellige, Swyx Solutions AG.
1 Access Control. 2 Objects and Subjects A multi-user distributed computer system offers access to objects such as resources (memory, printers), data.
1 Integrify 5.0 Tutorial : Creating a New Process In this tutorial, we will show you how to: Create a new process Add different task types into our process.
Introduction to Oracle Physical Structure Physical Structure Logical Structure Logical Structure SGA / PGA SGA / PGA Background Processes Background Processes.
© 2016 SlidePlayer.com Inc. All rights reserved.