Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Distributed Indexing of Web Scale Datasets for the Cloud Email:{ikons, eangelou, dtsouma}@cslab.ece.ntua.gr Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens Ioannis Konstantinou Evangelos Angelou Dimitrios Tsoumakos

Problem Increasing data volume (e-mail-web logs, historical data, click streams) pushes classic RDBMS to their limits. Centralized indices are slow to create and can’t scale to a large number of concurrent requests. Current MapReduce based data analysis tools (e.g. Hive, Pig) do not provide near real time access to user queries.

NoSQL Systems NoSQL databases: horizontal scalable, distributed non-relational data stores. Simple but fast queries. Relaxed ACID guarantees. Examples: Google's Bigtable, Amazon's Dynamo, Facebook's Cassandra and LinkedIn's Voldermort. Perfect candidates for cloud infrastructures: Shared nothing arch. enables elastic scalability

Our contribution A Distributed processing framework to index, store and serve large amounts of content data under heavy request loads. Content users provide the raw data along with simple indexing rules. NoSQL and MapReduce combination: MapReduce jobs process input to create index. Index and content is served through a NoSQL system.

Goals Support of almost any type of data – Unstructured, semi-structured and fully structured. Near real-time query response times – Query execution times should be in the order of milliseconds. Scalability (preferably elastic) – Both in terms of storage space and concurrent user requests. Ease of use – Simple index rules.

Architecture Row content is uploaded to HDFS. Content with index rules is fed to the Uploader, to create the Content table. The Content table is fed to the Indexer that extracts the Index table. The client api contacts the index table to perform searches, and the content table to serve objects.

Architecture - Index rules Instructions of what to index. Specify record boundaries to split input into distinct entities. Select content regions to index (granularity).

Uploader class Crunches data input to create the Content table using MapReduce. Mappers read input records, and create Hbase rows (one for each record). Reducers sort row and colums and write back to HDFS in Hfiles. Hbase API bypassed for speed reasons. MD5Hash total order partitioner.

Content table Row key: MD5Hash of the record content. Column ids: granularity levels with increment number. Row key and column id specify an HBase cell. Cell values contain the content to be indexed. Specific cell per each record contains the content that will not be indexed.

Indexer class Creates an inverted list of index terms and term locations (index table). Mapper input is the content table and output is Hbase cells of the index table. Reducers sort Hbase cells (input) and create appropriate Hfiles (output). SimpleTotalOrder Partitioner.

Index table Row key: index term followed by granularity (eg key is google_revision if “google” was found on a revision tag.) Column ids: Row key of content table (MD5Hash) along with the granularity increment number that points to the exact cell in the content record.

Client API Search for a keyword using Index table – Select level of granularity (Hbase Get), or search for all levels (Hbase Scan). Retrieve object from Content table – Using a simple Hbase Get operation

Xml indexer Users provide – A specific tagname used to split records. – A comma separated list of tagnames.

MySQL indexer Assumptions: dataset in two “splits” – Database description (e.g. obtained from mysqldump using the –no-data option) – Full data in single row dump Retains original information from MySQL schema to allow similar queries Follows similar conventions for indexing as the XML indexer, allowing searching with the same queries, without a priori knowledge as to the source of the data.

Experiments Indexing speed Time vs dataset size: How well does it scale for big data? Currently testing with 23G of Wikipedia data – planning 1.7 TB test with full Wikipedia dump. Intermediate sizes can be achieved by manually splitting the datasets.

Experiments (2) Time vs number of nodes: What is the speedup achieved by adding more nodes? Current tests were performed on two Hadoop clusters – clones with 10 nodes and xenons with 5 nodes. More nodes are needed… Time vs Hbase setup: The experiments were performed on “vanilla” installations of Hbase. Speed benefits can be achieved by tailoring the installation to our needs?

Future developments Index aggregation Indexing multiple Hbase tables, or even tables from different Hbase masters, thus creating a central repository for similar (and not similar) information for an organization. Technical challenge: creating an algorithm that can find similarities between the tables, thus allowing “smarter” queries.

Questions

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Similar presentations

Presentation on theme: "Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Similar presentations

Presentation on theme: "Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical."— Presentation transcript:

Similar presentations

About project

Feedback