Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIDR 20051/16 Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? Surajit Chaudhuri (Microsoft Research) Raghu Ramakrishnan (U.

Similar presentations


Presentation on theme: "CIDR 20051/16 Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? Surajit Chaudhuri (Microsoft Research) Raghu Ramakrishnan (U."— Presentation transcript:

1 CIDR 20051/16 Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? Surajit Chaudhuri (Microsoft Research) Raghu Ramakrishnan (U Wisconsin, ex QUIQ) Gerhard Weikum (Max-Planck Institute of CS) Warning: Non-technical Content! To Be Taken with Grain of SALT.

2 CIDR 20052/16 DB and IR: Two Parallel Universes canonical application: accountinglibraries data type: numbers, short strings text foundation: algebraic / logic based probabilistic / statistics based search paradigm: Boolean retrieval (exact queries, result sets/bags) ranked retrieval (vague queries, result lists) Database SystemsInformation Retrieval parallel universes forever ?

3 CIDR 20053/16 Take-home Message or Food for Disagreement Claim 1: DB&IR applications require and justify new platform / kernel system with appropriately designed API for a Scoring Algebra for Lists and Text (SALT) Claim 2: One key challenge lies in reconciling flexible scoring with query optimizability

4 CIDR 20054/16 Outline Top-down Motivation: DB&IR Applications Bottom-up Motivation: Algorithms & Tricks Towards SALT: Scoring Algebra(s) for Lists and Text Key Problem: Query Optimization

5 CIDR 20055/16 Top-down Motivation: Applications (1) - Customer Support - Typical data: Typical queries: Platform desiderata (from app developer‘s viewpoint): Customers (CId, Name, Address, Area, Category, Priority,...) Requests (RId, CId, Date, Product, ProblemType, Body, RPriority, WFId,...) Answers (AId, RId, Date, Class, Body, WFId, WFStatus,...) premium customer from Germany: „A notebook, model... configured with..., has a problem with the driver of its Wave-LAN card. I already tried the fix..., but received error message...“  request classification & routing  find similar requests Flexible ranking and scoring on text, categorical, numerical attributes Support for high update rates concurrently with high query load Incorporation of dimension hierarchies for products, locations, etc. Efficient execution of complex queries over text and data attributes Why customizable scoring? wealth of different apps within this app class different customer classes adjustment to evolving business needs scoring on text + structured data (weighted sums, language models, skyline, w/ correlations, etc. etc.)

6 CIDR 20056/16 Top-down Motivation: Applications (2) More application classes: Global health-care management for monitoring epidemics News archives for journalists, press agencies, etc. Product catalogs for houses, cars, vacation places, etc. Customer relationship management in banks, insurances, telcom, etc. Bulletin boards for social communities P2P personalized & collaborative Web search etc.

7 CIDR 20057/16 Top-down Motivation: Applications (3) Next wave Text2Data: use Information-Extraction technology (regular expressions, HMMs, lexicons, other NLP and ML techniques) to convert text docs into relational facts, moving up in the value chain Example: „The CIDR‘05 conference takes place in Asilomar from Jan 4 to Jan 7, and is organized by D.J. DeWitt, Mike Stonebreaker,...“ Conference Name Year Location Date CIDR 2005 Asilomar 05/01/04 ConfOrganization Name Year Chair CIDR 2005 P68 CIDR 2005 P35 Prob 0.95 Prob 0.9 0.75 facts now have confidence scores queries involve probabilistic inferences and result ranking relevant for „business intelligence“ People Id Name P35 Michael Stonebraker P68 David J. DeWitt

8 CIDR 20058/16 1, 2, 3 most strongly affect platform architecture and API Top-down Motivation: Applications (4) Essential requirements for DB&IR platform: 1) Customizable scoring and ranking 2) Composite queries incl. joins, filters & top-k 3) Optimizability of query expressions 4) Metadata and ontologies 5) Simple, sufficiently expressive data model (XML light) 6) Data preparation (entity recognition, entity resolution, etc.) 7) Personalization (profile learning) 8) Usage patterns (query logs, click streams, etc.) CIDR 2005 P350.75

9 CIDR 20059/16 Bottom-up Motivation: Algorithms & Tricks t1 B+ tree on terms, categories, values,... 17: 0.3... t2... t3... 52: 0.1 53: 0.8 51: 0.6 12: 0.5 11: 0.4... 28: 0.1 44: 0.2 51: 0.6 52: 0.3 17: 0.1 52: 0.7... 17: 0.3 17: 0.1 44: 0.4 44: 0.2 11: 0.6 index lists with (ID, s = tf*idf) sorted by ID Google: > 10 mio. terms > 8 bio. docs > 4 TB index Vanilla algorithm „join&sort“ for query q: t1 t2 t3 top-k (  [term=t1](index)    ID  [term=t2](index)    ID  [term=t3](index)    ID order by sum(s) desc) Good search engines use a variety of heuristics and tricks for shortcutting: keeping short lists of best docs per term in memory global statistics for index list selection early pruning of result candidates bounded priority queue of candidates

10 CIDR 200510/16 Bottom-up Motivation: Algorithms & Tricks Index lists s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 … Data items: d 1,…,d n Query: q = (t 1,t 2,t 3 ) RankDocWorst- score Best- score 1d780.92.4 2d640.82.4 3d100.72.4 RankDocWorst- score Best- score 1d781.42.0 2d231.41.9 3d640.82.1 4d100.72.1 RankDocWorst- score Best- score 1d102.1 2d781.42.0 3d231.41.8 4d641.22.0 … … t1t1 d78 0.9 d1 0.7 d88 0.2 d10 0.2 d78 0.1 d99 0.2 d34 0.1 d23 0.8 d10 0.8 d1d1 d1d1 t2t2 d64 0.8 d23 0.6 d10 0.6 t3t3 d10 0.7 d78 0.5 d64 0.4 STOP! TA with sorted access only (NRA) (Fagin 01, Güntzer/Kießling/Balke 01): scan index lists; consider d at pos i in L i ; E(d) := E(d)  {i}; high i := s(t i,d); worstscore(d) := aggr{s(t,d) |  E(d)}; bestscore(d) := aggr{worstscore(d), aggr{high |  E(d)}}; if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’  top-k}; else if bestscore(d) > min-k then cand := cand  {d}; threshold := max {bestscore(d’) | d’  cand}; if threshold  min-k then exit; TA: efficient & principled top-k query processing with monotonic score aggr. TA flavor w/ early termination is great Implementation details are crucial DB&IR needs to combine it with filter, join, phrase matching, etc. Unclear how to abstract TA and integrate into relational algebra Scan depth 1 Scan depth 1 Scan depth 2 Scan depth 2 Scan depth 3 Scan depth 3 k = 1

11 CIDR 200511/16 SALT Algebra: Three Proposals Related prior work: probabilistic relations approximate query processing query algebras on lists SQL user-defined aggregation Goals: reconcile relational algebra with TA-flavor operators reconcile flexible scoring with query optimizability SALT = Scoring Algebra for Lists and Text Three proposals: Speculative filters and stretchable operators Operators with scoring modalities Scoring operator 

12 CIDR 200512/16 Speculative Filters and Stretchable Operators (SALT with SQL Flavor) Rationale: map ranked-retrieval queries to multidimensional SQL filters such that they return approx. k results Ex.: recent WLAN device driver problems on notebook T40 (with Debian)   [date > 11/30/04  class=„/network/drivers“  product=„Thinkpad“  software=„Linux“] (Requests) Techniques: ranking many answers  speculative filters generate additional conjunctive conditions to approximate top-k finding enough answers  stretchable operators relax (range or categorical) conditions to ensure at-least-k similar to IR query expansion, by (pseudo-)feedback, thesaurus, query log Proposal:  ~[k, date > 1/4/05  class=„/network/drivers/wlan“  product=„T40“] (...) generally:  ~[k],    ~[k],  ~[k],... Properties and problems: + can leverage multidim. histograms ? composability of operators ? choice of filters for approx. k top-level results date > 11/30/04 software=„Linux“ class=„/network/drivers“ product=„Thinkpad“

13 CIDR 200513/16  Operator (SALT with TA Flavor) Rationale: all operators produce lists of tuples a  operator encapsulates customizable scoring  can be efficiently implemented in relational kernel Technique:  [  ; , F; T] (R) consumes prefixes of an input list R with a set  of simple aggregation functions, each with O(1) space and O(|prefix|) time („accumulators“) a scoring function  : dom(R)  out(  )  real a filter condition F as in , referring to current tuple and  values a stopping condition T, of the same form as F similar to SQL rank( ) with user-defined aggregation (and LDL++ aggregation), but with early termination! Ex.: sort[k, Score, desc] (  [  : min-k := min{Score(t)|t  input}; threshold :=...;  (t) := sum(R1.Score, R2.Score, C1.Score) as Score; F: Score > min-k  |input|<k; T: min-k  threshold  |input|  k] (merge( sort[...] (  [...](Requests R1...)), sort[...] (  [...](Requests R2...)), sort[...] (  [...](Customers C1...))) Properties and problems: + pipelined processing of list prefixes + can be implemented by TA with bounded queue ? difficult to integrate into query rewriting ? difficult for cost estimation

14 CIDR 200514/16 Key Problem: Query Rewriting Goal: establish algebraic equivalences for SALT expressions as a basis for query rewriting Examples: commutativity of stretchable top-k and standard selection  ~[k, date > 1/4/05] (  [product=„T40“] (R))   [product=„T40“] (  ~[k, date > 1/4/05] (R)) commutativity of scoring operator  and standard selection distributivity of scoring operator  over union... Wishful thinking! Technical challenge: Either work out correct & useful rewriting rules or establish „approximate equivalences“ of the kind  ~[k, F] (  [G] (R))  sort[k,...] (  [G] (  ~[k*, F] (R)) with proper k* ideally with quantifiable error probabilities 

15 CIDR 200515/16 Key Problem: Cost Estimation 2) cost estimation for top-k ranked retrieval: when will we stop? (for  : length of input prefix; for TA: scan depth on index lists) 1) usual DB cost estimation: selectivity of multidimensional filters We claim that 2 is harder than 1 ! Index lists … … … t1t1 d78 0.9 d1 0.7 d88 0.2 d10 0.2 d78 0.1 d99 0.2 d34 0.1 d23 0.8 d10 0.8 t2t2 d64 0.8 d23 0.6 d10 0.6 t3t3 d10 0.7 d78 0.5 d64 0.4 Technical challenge: Develop full estimator for top-k execution cost Possible approaches (Ilyas et al.: Sigmod’04, Theobald et al.: VLDB’04): Probabilistically predict (quantile of) aggregated score of data item d: precompute score-distribution histogram for each single dim compute convolution of histograms at query time to predict P[  i S i   ] View scores X 1 > X 2 >... > X n of n data items as samples from S =  i S i Use order statistics to predict score of rank-k item and scan depth at stopping time

16 CIDR 200516/16 Conclusion: Caveats and Rebuttals Is there anything new here? Literature has bits and pieces, but no strategic view DB&IR is important, SALT algebra is one key aspect Is there business value in DB&IR? Yes, for both individual apps and general text2data. Where do we go from here? Detailed design & impl. of SALT, with query optimization Don‘t eXtensible DBSs or intranet SEs cover 90%? XDBSs with UDFs too complex, SEs lack query optimization Do IR people believe in DB&IR? Yes: prob. Datalog, XML IR, statistical relational learning, etc. Do IR people believe in SALT and query opt.? No: mostly driven by search result quality, largely disregard performance Does SE industry believe in SALT and query opt.? No: simple consumer-oriented search or small content mgt. apps


Download ppt "CIDR 20051/16 Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? Surajit Chaudhuri (Microsoft Research) Raghu Ramakrishnan (U."

Similar presentations


Ads by Google