Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.

Slides:



Advertisements
Similar presentations
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Advertisements

Database management system (DBMS)  a DBMS allows users and other software to store and retrieve data in a structured way  controls the organization,
Chapter 5: Introduction to Information Retrieval
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Anand Hegde Prerna Shraff Performance Analysis of Lucene Index on HBase Environment Group #13.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
BY VAIBHAV NACHANKAR ARVIND DWARAKANATH Evaluation of Hbase Read/Write (A study of Hbase and it’s benchmarks)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Databases Chapter 11.
Chapter 5: Information Retrieval and Web Search
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Databases & Data Warehouses Chapter 3 Database Processing.
HADOOP ADMIN: Session -2
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Apache Cassandra - Distributed Database Management System Presented by Jayesh Kawli.
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
CSED421 Database Systems Lab. Welcome Lab Class –Library 501, Fri 9:00 – 10:40 Teacher Assistants – 안석현, 이상훈 –{ashworld, –IDS.
HathiTrust Research Center Architecture Data subsystem.
Chapter 6: Information Retrieval and Web Search
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
By Vaibhav Nachankar Arvind Dwarakanath.  HBase is an open-source, distributed, column- oriented and sorted-map data storage.  It is a Hadoop Database;
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Full-Text Support in a Database Semantic File System Kristen LeFevre & Kevin Roundy Computer Sciences 736.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Supporting Queries and Analyses of Large- Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL databases Xiaoming Gao,
Nov 2006 Google released the paper on BigTable.
CPS 216: Advanced Database Systems Shivnath Babu.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Bigtable: A Distributed Storage System for Structured Data
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Scaling Big Data Mining Infrastructure: The Twitter Experience
CS 405G: Introduction to Database Systems
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CS122B: Projects in Databases and Web Applications Winter 2017
Building Search Systems for Digital Library Collections
Central Florida Business Intelligence User Group
Ministry of Higher Education
NoSQL Systems Overview (as of November 2011).
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Analysis of Lucene Index on Hbase in an HPC Environment
Introduction to Apache
Data Mining Chapter 6 Search Engines
Overview of big data tools
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Academic & More Group 4 谢知晖 王逸雄 郭嘉宋 程若愚.
Presentation transcript:

Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu

Outline Introduction System design and implementation Preliminary index data analysis Comparison with related work Future work

Introduction Background: data intensive computing requires storage solutions for huge amounts of data One proposed solution: HBase, Hadoop implementation of Google’s BigTable

Introduction HBase architecture: Tables split into regions and served by region servers Reliable data storage and efficient access to TBs or PBs of data, successful application in Facebook and Twitter Problem: no inherent mechanism for field value searching, especially for full-text values

Introduction Inverted index: - ->,, … - “computing” -> doc1, doc3, … Apache Lucene: - Inverted index library for full-text search - Incremental indexing, document scoring, and multi-index search with merged results, etc. - Existing Lucene-based indexing systems use files to store index data – not a natural integration with HBase Solution: integrate and maintain inverted indices directly in HBase

System design Data from a real digital library application - Bibliography data, page image data, texts data - Requirements: answer users’ queries for books, and fetch book pages for users Query format: - { : term1, term2,...; : term1, term2,...;...} - {title: "computer"; authors: "Radiohead"; text: "Let down"}

System design Client HBase Book bibliography table Book text data table Book image data table Lucene index tables ② ③ ④ ① ⑤ ⑥⑥

System design Table schemas: TableSchema Book bibliography table --> {md:[title, category, authors, createdYear, publishers, location, startPage, currentPage, ISBN, additional, dirPath, keywords]} Book text data table --> {pages:[1, 2,...]} Book image data table - --> {image:[image]} Lucene index tables --> {frequencies:[,,...]} --> {positions:[,,...]}

System design Index table schema for storing term frequencies: frequencies … (other book ids) “database” 3 4 … Index table schema for storing term position vectors: positions … (other book ids) “database” 1, 24, 33 1, 34, 77, 221 …

System design Benefits of the system architecture: - Natural integration with HBase - Reliable and scalable index data storage - Distributed workload for index data access - Real-time document addition and deletion - MapReduce programs for building index and index data analysis

System implementation Experiments completed in the Alamo HPC cluster of FutureGrid MyHadoop -> MyHBase

System implementation Workflow:

Preliminary index data analysis Number of books indexed: 2294 Number of distinct terms: terms (73%) appear only in 1 book. “1” appears in 1904 books.

Preliminary index data analysis terms (63%) appear only once in all books. “we” appears times in the whole data set.

Preliminary index data analysis 94% of all terms have a record size of <= 500 bytes in the frequency index table. Largest record size: 85KB for “from”. Smallest record size: 48 bytes for “w9”.

Comparison with related work Pig and Hive: - Pig Latin and HiveQL have operators for search, but not based on indices - Suitable for batch analysis to large data sets SolrCloud, ElasticSearch, Katta: - Distributed search systems based on Lucene indices - Indices organized as files; not a natural integration with HBase - Each has its own system management mechanisms Solandra: - Inverted index implemented as tables in Cassandra - Different index table designs; no MapReduce support

Future work Distributed performance evaluation Distributed search engine integrated with HBase region servers More data analysis or text mining based on the index support

Thanks! Questions?