Presentation is loading. Please wait.

Presentation is loading. Please wait.

Geoffrey Architecture for real-time ad-hoc query on distributed filesystems.

Similar presentations

Presentation on theme: "Geoffrey Architecture for real-time ad-hoc query on distributed filesystems."— Presentation transcript:

1 Geoffrey Hendrey @geoffhendrey Architecture for real-time ad-hoc query on distributed filesystems

2 Motivation Big Data is more opaque than small data – Spreadsheets choke – BI tools can’t scale – Small samples often fail to replicate issues Engineers, data scientists, analysts need: – Faster “time to answer” on Big Data – Rapid “find, quantify, extract” Solve “I don’t know what I don’t know” This is NOT about looking up items in a product catalog (i.e. not a consumer search problem)

3 Scaling search with classic sharding

4 Classic “side system” approach Definition of KLUDGE: “a system and especially a computer system made up of poorly matched components” –Merriam-Webster Hadoop Search Cluster Search Cluster ?????

5 Classic “search toolkit” Built around fulltext use case Inverted Indexes optimized for on-the-fly ranking of results – TF-IDF – Okapi BM-25 Yet never able to fully realize google-style search capability Issues: – Phrase detection – Pseudo synonymy – Open loop architecture

6 Big data ad-hoc query Not typically a fulltext “document search” problem Data is structured, mixed structured, and denormalized – Log lines – Json records – CSV files – Hadoop native formats (SequenceFile) Ranking is explicit (ORDER BY), not relevance based Sometimes “needle in haystack” (support, debugging) Sometimes “haystack in haystack” (summary analytics, segmentation)

7 Dremel MPP query execution tree

8 Finer points of Dremel architecture MapReduce friendly In-Situ approach is DFS friendly Excels at aggregation. Not so much for needle-in- haystack. Column storage format accelerates mapreduce (less extraneous data pushed through) But in some regards still a “side system” Applications must explicitly store their data in a columnar format “massive” is both a benefit and a hazard – Complex (operationally and WRT query execution) – Queries can execute quickly…on huge clusters

9 Crawled In-Situ Index Architecture HDFS MapReduce Data Crawl In-situ Index SimpleSearch Application Hadoop

10 Benefits to crawled In-Situ index No changes to application data format – CSV – JSON – SequenceFile Clear “separation of concerns” between data and index Indexes become “disposable”: easily built, easily thrown away There is no “side system” that needs to be maintained Use the mapreduce “hammer” to pound a nail

11 Architect for Elasticity AWS S3 Elastic MapReduce JetS3t EC2 M1.large EC2 M1.large Application Crawl Index HTTP Interesting: you don’t actually need to have hadoop installed…

12 Declarative Crawl Indexing HDFS MapReduce Data Crawl In-situ Index SimpleSearc h Application Hadoop { "filter”:"column[4]==\"athens\"" } { "filter”:"column[4]==\"athens\"" } Parse.json Indexer reads declarative instructions from in-situ file “pull” vs. traditional “push” indexing approach

13 Thin index Index size is small because data is a holistic part of the system data does not need to be “put into” the search system and repicated in the index. HDFS MapReduce Data Crawl In-situ Index Data Index

14 Lazy data loading HDFS MapReduce Data Crawl Execution Runtime Execution Runtime Data Index LRU Index Cache LRU Index Cache Lazy Pull

15 Column Oriented Approach

16 Contact Info Email: Private Beta

Download ppt "Geoffrey Architecture for real-time ad-hoc query on distributed filesystems."

Similar presentations

Ads by Google