Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.

Slides:



Advertisements
Similar presentations
Omid Efficient Transaction Management and Incremental Processing for HBase Copyright © 2013 Yahoo! All rights reserved. No reproduction or distribution.
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
A Fast Growing Market. Interesting New Players Lyzasoft.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
Mihai Pintea. 2 Agenda Hadoop and MongoDB DataDirect driver What is Big Data.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Cloud Computing Other Mapreduce issues Keke Chen.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
:: Conférence :: NoSQL / Scalabilite Etat de l’art Samuel BERTHE10 Mars 2014Epitech Nantes.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
WTT Workshop de Tendências Tecnológicas 2014
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
SpatialHadoop:A MapReduce Framework
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Big Data Yuan Xue CS 292 Special topics on.
Learn Hadoop and Big Data Technologies. Hadoop  An Open source framework that stores and processes Big Data in distributed manner on a large groups of.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Microsoft Ignite /28/2017 6:07 PM
OMOP CDM on Hadoop Reference Architecture
Image taken from: slideshare
Presented by: Omar Alqahtani Fall 2016
CS 405G: Introduction to Database Systems
SAS users meeting in Halifax
CS122B: Projects in Databases and Web Applications Winter 2017
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Spark Presentation.
NoSQL Systems Overview (as of November 2011).
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Database Management Systems
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp

Contents  What is Spatial Tajo?  Motive for Development  Why I chose Apache Tajo?  Plan for the implementation of the plug-in  Current status ◦ Parts implemented ◦ Parts not yet implemented  Conclusion  References

What is Spatial Tajo?  A plug-in to provide spatial queries for Tajo  In detail, it is a plug-in allowing the provision and performance of queries about spatial relations and spatial analysis for data sets stored in the distributed data warehouse system. ◦ Providing spatial functions for spatial queries ◦ Supporting data types ◦ Supporting an index structure for spatial data sets ◦ Supporting raster data Overall Architecture of Apache Tajo Tajo Worker Local File- System HDFS Amazon S3 QueryMaster Local Query Engine StorageManager Spatial Tajo Tajo Master CatalogStore Allocate a query Manage metadata Client JDBC SQL Shell Web UI

Motive for development  The volume of the spatial data sets to be analyzed come near big data, and I’d like to analyze this using SQL.  I'd like to use a system working without batch processing.  I'd like to use free software or free solution (Provided that I contribute my experiences to communities).

Motive for development  Of course, there are good ones among the known software and solutions. But… ◦ Relational Database and DBMS ◦ Oracle Spatial and Graph (Oracle Database+Plug-in) ◦ MySQL DBMS ◦ PostGIS with PostgreSQL ◦ NoSQL ◦ Document-oriented database: MongoDB, CouchDB (Plug-in), RethinkDB ◦ HBase, Hive ◦ Cluster and Cloud ◦ GeoMesa, ESRI GIS Tools for Hadoop, SpatialHadoop ( (on Hadoop) ◦ CartoDB (on top of PostgreSQL, PostGIS and SaaS)  Conclusion: FOSS + Spatial → Apache Tajo + Spatial Plug-in!

Why I chose Apache Tajo?  Apache TajoApache Tajo ◦ A robust big data relational and distributed data warehouse system for Apache Hadoop ◦ Designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large-da ta sets stored on HDFS and other data sources.  Why did I choose Apache Tajo?: Features ◦ Motions of distributing and storing data are entrusted to Hadoop or Amazon S3 ◦ It supports the insertion of data, but doesn’t support the update of the data ◦ Compatibility: Can query with ANSI/ISO SQL ◦ It is faster than processing using MapReduce and guarantees fault tolerance ◦ It is easy to build up and manage ◦ The user can implement and plug it in yourself if necessary

Plan for the implementation of the plug-in  Spatial functions for spatial queries ◦ Distances, Equals, Disjoints, Intersects and Touches, Crosses, Overlaps, Contains, Lengths, Areas and Centroids ◦ Transforming functions for spatial types (like from_OOOO or to_OOOO)  Adding spatial data types  Enabling to run kNN queries  Supporting an index for spatial data ◦ R-tree, Quad-tree and KD-tree, etc.

Current status – Parts implemented  Most primary spatial functions ◦ Implementing most primary spatial functions using JTS ◦ Distances, Equals, Disjoints, Intersects, Touches, Crosses, Overlaps and Contains  Running kNN queries ◦ It can run kNN queries using the implemented spatial functions.  Indexing for spatial data ◦ R-tree indexing using Sort-Tile-Recursive (STR) Local indexes Global Index The process of reading data using the index 1.Reading the global index and finding search keys 2.Finding local indexes corresponding to the search keys, 3.Finding the search keys in the local indexes 4.Directly reading the tuples. Wikipedia: R-tree - STR,SpatialHadoop PaperSpatialHadoop Paper and Tajo DocumentTajo Document

Current status – Parts not yet implemented  Adding spatial data types ◦ Parameters of spatial functions are inaccurate and inconvenient to use.  Spatial functions not yet implemented ◦ Lengths, areas and centroids ◦ Transform functions (e.g. from_OOO and to_OOO)  Optimizing functions and queries ◦ Optimizing spatial functions ◦ Optimizing kNN queries  Indexing spatial data ◦ Quad-tree (with GeoHash) and KD-tree  Modularization ◦ Currently, since it is not separated from Apache Tajo, it is impossible to install the plug-in.

Conclusion  What is Spatial Tajo?  Motive for Development  Why I chose Apache Tajo?  Plan for the implementation of the plug-in  Current status ◦ Parts implemented ◦ Parts not yet implemented

References  Apache Tajo ◦ Official Website, Source codes, User documentation Official WebsiteSource codesUser documentation ◦ Efficient In-situ processing of various storage types on Apache Tajo Efficient In-situ processing of various storage types on Apache Tajo ◦ Apache Tajo Enters the SQL-on-Hadoop Space Apache Tajo Enters the SQL-on-Hadoop Space ◦ SQL-on-Hadoop: What does “100 times faster than Hive” actually mean? SQL-on-Hadoop: What does “100 times faster than Hive” actually mean? ◦ Setting up an Apache Tajo Cluster on Amazon EMR Setting up an Apache Tajo Cluster on Amazon EMR  PostgreSQL and PostGIS Document  SpatialHadoop ◦ Official Website, Source codes Official WebsiteSource codes ◦ A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data ◦ Spatialhadoop: towards flexible and scalable spatial processing using MapReduce Spatialhadoop: towards flexible and scalable spatial processing using MapReduce

Reference  Spatial Databases: With Application to GIS Spatial Databases: With Application to GIS  Indexing ◦ STR: A simple and efficient algorithm for R-tree packing STR: A simple and efficient algorithm for R-tree packing ◦ Spatialhadoop: towards flexible and scalable spatial processing using mapreduce Spatialhadoop: towards flexible and scalable spatial processing using mapreduce ◦ Tajo: A distributed data warehouse system on large clusters Tajo: A distributed data warehouse system on large clusters ◦ Apache Tajo Documents: Index types Apache Tajo Documents: Index types  Wikipedia ◦ Spatial Database Spatial Database ◦ Spatial Query Spatial Query ◦ R-tree R-tree

Thank You for listening Do you have any questions? Please me with the questions, and I’ll answer them in GitHub :