Search and Query: An {Over, Re}view Joseph M. Hellerstein Computer Science Division UC Berkeley

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
An Engineering Approach to Computer Networking
Information Retrieval in Practice
Databases. Database Information is not useful if not organized In database, data are organized in a way that people find meaningful and useful. Database.
Database management concepts Database Management Systems (DBMS) An example of a database (relational) Database schema (e.g. relational) Data independence.
Object Naming & Content based Object Search 2/3/2003.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Towards Adaptive Dataflow Infrastructure Joe Hellerstein, UC Berkeley.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Overview of Search Engines
Cloud Computing Other Mapreduce issues Keke Chen.
11.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 11: Introducing WINS, DNS,
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Introduction to Peer-to-Peer Networks. What is a P2P network Uses the vast resource of the machines at the edge of the Internet to build a network that.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Introduction Widespread unstructured P2P network
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Introduction to Peer-to-Peer Networks. What is a P2P network A P2P network is a large distributed system. It uses the vast resource of PCs distributed.
UC Berkeley Scaleable Structured Datastorage for Web 2.0 Michael Armbrust, David Patterson October, 2007.
Data Structures & Algorithms and The Internet: A different way of thinking.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Network Computing Laboratory Scalable File Sharing System Using Distributed Hash Table Idea Proposal April 14, 2005 Presentation by Jaesun Han.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
The Internet 8th Edition Tutorial 4 Searching the Web.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Enabling Peer-to-Peer SDP in an Agent Environment University of Maryland Baltimore County USA.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
MongoDB is a database management system designed for web applications and internet infrastructure. The data model and persistence strategies are built.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Peer to Peer Network Design Discovery and Routing algorithms
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Information Retrieval in Practice
Open Source distributed document DB for an enterprise
COMP 430 Intro. to Database Systems
CHAPTER 3 Architectures for Distributed Systems
Net 323 D: Networks Protocols
Early Measurements of a Cluster-based Architecture for P2P Systems
EE 122: Peer-to-Peer (P2P) Networks
Database management concepts
Database management concepts
H-store: A high-performance, distributed main memory transaction processing system Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex.
Introduction to Information Retrieval
Evaluation of Relational Operations: Other Techniques
CloudAnt: Database as a Service (DBaaS)
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Search and Query: An {Over, Re}view Joseph M. Hellerstein Computer Science Division UC Berkeley

Background My interests: Database & Information Systems Tech: Indexing, Query Processing, Dataflow, Parallel/Distributed/Federated Systems Apps: Interactive Data Analysis, eMarkets, eCatalogs, Structured Querying on Internet, Querying Sensors My (lack of) agenda for today Examples to defamiliarize today’s modes of operation Technical breakdown/review of system components Core/common vs. Value-Added Natural policy discussion Core = “free” and “fair”? Value-Added = “expensive” and “market-driven”? Conflicting needs of core scaling and policy?

A Note of Caution Contemplating a redesign? Fine. But… Familiarity often trumps other desiderata Memories are short, cross-fertilization is poor Even in the technical community Trap: common culture/techniques: “What’s hot”: everybody knows web searching “What’s packaged”: Database technology = Oracle Rather than “What’s core technology” In Expanding DNS, be sure to review known technology What’s core under the packaging of related technologies? Reuse/generalize common ideas Leave specialized ideas above the core (minimize core constraints) Good way to evaluate “brand new” ideas, too How do they challenge the core? Is the core “right”?

Beware of Today’s Common Culture Rise of the web => Rise of Full Text Search Used to be a library app (yawn) Suddenly the dominant paradigm (??) Beware: Internet  Web! DNS is not document-centric Neither are Peer-to-Peer filesharing or “Deep Web” Facts and Figures Keyword search may not be core functionality If DNS is to serve many needs… Must separate core search services from varying apps! Try to accommodate what’s hot tomorrow What’s a core service? What’s an app?

Road Map Defamiliarization Some new Internet “search” apps Provided by Telegraph An {over, re}view Some basics of querying Some separation of the issues Discussion…

Peer-to-Peer Filesharing Napster, Gnutella, Aimster, etc. And a pile of ill-defined startups This is not the web, but it is part of the Internet Lessons already! Distributed files and search by substring Isn’t that heaven?! Would like to handle spelling, language translation (pig latin), etc. A layer of canonicalization (“nameprep”) Well, what about these translations… I post my name mapping scheme on my website Download the Billboard Top Ten Download tunes I’ve paid for No problem…

TeleNap: Amazon Meets Napster OpenNap Servers Album Information

Telenap: Amazon Meets Napster

“Search” vs. Query “Search” can return only what’s been stored E.g., best match at iWon, Google, AskJeeves top ten:

“Search” vs. Query “Search” can return only what’s been stored E.g., best match at iWon, Google, AskJeeves top ten:

But the Basic Facts are There!

Query >> Search: “Federated Facts and Figures” Yahoo join FECInfo Imagine DNS analogies: Find IP addresses of hosts registered by companies that donated to GW Bush FECInfo join WHOIS Note: errors! Can do fuzzy matching and ranking, too

Query >> Search: “Federated Facts and Figures” APBNews join FECInfo Imagine DNS analogies: Show # of live nodes, Roll Up By Domain Show Avg Ping time, Roll Up By Domain Show Best Ping Time per Group, Roll Up By Domain

Slippery Slope of “Mapping” Step 1: Exact Match Lookups Step 2: Simple canonicalization E.g. toupper() Step 3: Complex canonicalization E.g. “nameprep” Step 4: Join w/Arbitrary Mapping Tables Centralized “keywords” (RealNames) Personalized mapping, “bookmarks” Geo-located mapping tables for Localization Load-balancing mapping via net status queries Ad-hoc mapping: ad hoc database join queries! E.g. “The lightest-loaded machine containing >5 of the Billboard Top Ten Songs” Q: How much of this belongs in DNS??

Slippery Slope of “Mapping” Input ValueOutput Value Cs.Berkeley.EDU cHeeseCHEESE Nueva_York_YanquisNYYankees Cheesewww.kraft.com Dear old Thai Foodwww.plearn.com White AlbumBack in the USSR White AlbumDear Prudence

Slippery Slope of Querying Step 1: Single collection, exact lookup Step 2: Single collection, fancy lookup “Search” Step 2: Combine multiple collections E.g. Mapping (join), “Metasearch” (union) Step 3: Roll-up/drill-down Step 4: Statistical analyses Step 5: “Turing-complete” queries Prolog/Datalog How much of this belongs in DNS?? Note: Step 5 was even rejected in databases!

Where Does it End?? All this functionality seems plausibly useful Some of it even necessary! But what should be in network infrastructure? Vs. centralized, replicated services Who pays for these services?

Querying 101 Access Methods e.g. B-trees, Hash Indexes Used in Directories, Databases, Search Engines. Can be distributed. e.g. Joins, Intersect, Union, Rank/Sort, Group/Aggregate (Stat. Summaries) Used in Databases, Search Engines Parallelized on Well-Maintained Clusters Bulk Data ProcessingQuery Optimization e.g. Rewriting in canonical form, Choosing AMs and Processing Options. Very different techniques in Databases, Search Engines, though not mutually exclusive. This is Core and Common to All Query systems.

Access Methods Local search: Main-memory data structures binary trees, hashtables, skip lists, etc. Disk-based data structures B-trees, linear hash indexes, etc. Typically equality and/or range lookups Distributed search Flat partitioning (hash, range), w/replication Hierarchical partitioning More recent multi-hop search & replication E.g. CAN, Chord, PAST, Tapestry, Pastry Equality lookups only (so far), no need for hierarchies Proposed as a DNS replacement by networking researchers

Data Processing Flow the data through processing code Absent in Directory Services Used in database systems (in the box) Ad hoc combinations of operators possible Constrained use in text search engines (in the box) 1 collection: (word, docID, position, score, …) Indexed by stemmed word OR, AND, NOT: Union/Intersect/Subtract Map docIDs to URLs, snippets, etc. Sort by a (magic) function of position, score, etc. Text search is one (highly tuned!) database query Fun research: genericize this dataflow technology Goal of Telegraph project: adaptive dataflow (out of the box) Cluster-based implementation Distributed (P2P) implementation

Query Optimization Text Search Query rewrite: stemming, stop words, thesaurus Scheduling: which machine(s) on cluster Based on data partitioning, load & data statistics Database Systems Query rewrite: authorization, “views” Choices among (redundant) access methods Choices among data processing algorithms (joins) Choices of reorderings for these algorithms Scheduling: which machines(s) on cluster Based on data partitioning, load & data statistics Lots of fancy tricks here!

And what about storage semantics?? So far we only looked at the query side Query results only as good as the data! Again, varying solutions here…

Replication & Data Consistency Databases do Transactions Atomic, durable updates across multiple records Data consistency guaranteed Distributed transactions possible, but slow Two-Phase Commit Most people do “warm” replication Log-shipping w/xactional networking -- MQ Heavyweight technology! Brewer’s CAP “theorem” Directories tend to use “leases” (TTL) Tend to be per record or collection Cross-object consistency not guaranteed A little drift is often OK In a scenario with mapping, may need to think about atomicity across records/tables

One View: Core vs. Apps Ensure that core services: Scale to large # of machines (~size of Internet) Work in presence of failures Well-defined standard results: no surprises/ambiguity! E.g. SQL subsets, Boolean text search Need not be a user in the loop! Unanticipated scenarios in future Support ad hoc queries -- think cross-paradigm Apps: Scale to clusters, need not scale to size of Internet Allow for mapping/customization Results can be preference-dependent What about time-/geo-dependent?? Allow result browsing, analysis Fuzzy results, roll-ups, summaries all OK: user in the loop! Impose a query paradigm

More? Telegraph: Federated Facts and Figures: