Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,

Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer, Eugene Shekita)

Case Study: Internet Archive

Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … SELECT * FROM Movies M ORDER BY score(M.description, “golden gate”) FETCH TOP 10 RESULTS ONLY

Main Issue Traditional IR ranking methods would rank the two movies about the same Example: TF-IDF –“Golden Gate” appears exactly once in both descriptions –Length of the text fields are about the same –Hence: same normalized TF-IDF score Larger issue: Traditional IR scoring methods developed for stand-alone document collections

Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … Reviews Name 10bleblanc 2 Rating Mid 20 cooker4 10harry1 Rid 901 902 903 20 alice5904 ………… Statistics Visits 10285 90 Downloads Mid 20927247 Sid 81 82 ………… Structured Value Ranking (SVR)

Structured Value Ranking Use structured data values associated with text columns to score results Main technical challenge –Structured data value (and hence scores) change frequently and possibly dramatically! Number of visits, downloads, award announcements “SlashDot effect” Bursts and rapidly changing popularity [Kleinberg] –Users still want to see results ordered by latest score values

Dealing with Score Updates Traditional top-k algorithms: order inverted lists by score –Top-k queries answered efficiently by scanning only top part of inverted list Not efficient if scores are updated –Need to reorder inverted lists Solution: –New family of inverted lists that are maintained in approximate score order –Correct for approximation during query processing

Summary of Proposed Techniques SQL-based technique for specifying SVR in a relational database New family of inverted lists that are robust to score updates, while still efficient for queries –Can specify update-query tradeoff –Combination of SVR and TF-IDF scores Can be implemented using existing relational technology such as B+-trees

Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion

Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … Reviews Name 10bleblanc 2 Rating Mid 20 cooker4 10harry1 Rid 901 902 903 20 alice5904 ………… Statistics Visits 10285 90 Downloads Mid 20927247 Sid 81 82 …………

System Architecture Relational Query Engine Relational Tables and Indices Text Management Component RDBMS Create Text IndexSQL/MMResults Relational Sub-query Text Query Engine Materialized Views for SVR Scores Novel Indices using B+-trees SQL Specification of SVR Scores Results & scores Keyword Query

Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … Reviews Name 10bleblanc 2 Rating Mid 20 cooker4 10harry1 Rid 901 902 903 20 alice5904 ………… Statistics Visits 10285 90 Downloads Mid 20927247 Sid 81 82 …………

SQL-Based SVR Specification create function S1 (id: integer) returns float return SELECT Avg(R.rating) FROM Reviews R WHERE R.Mid = id create function S2 (id: integer) returns float return SELECT S.Visits FROM Statistics S WHERE S.Mid = id create function S3 (id: integer) returns float return SELECT S.Downloads FROM Statistics S WHERE S.Mid = id create function Agg (s1, s2, s3: float) returns float return (s1*100 + s2/2 + s3)

SQL-Based SVR Specification create function S1 (id: integer) returns float return SELECT Avg(R.rating) FROM Reviews R WHERE R.Mid = id create function S2 (id: integer) returns float return SELECT S.Visits FROM Statistics S WHERE S.Mid = id create function S3 (id: integer) returns float return SELECT S.Downloads FROM Statistics S WHERE S.Mid = id create function Agg (s1, s2, s3, s4: float) returns float return (s1*100 + s2/2 + s3 + s4/2) (s4 = TFIDF())

Efficiently Maintaining SVR Scores One of key challenges: SVR scores can change frequently Solution: use materialized views –Leverage relational technology –Benefit of SQL-based SVR specification create materialized view Score as SELECT Agg(S1(M.Mid), S2(M.Mid), S3(M.Mid)) FROM Movies M

System Architecture Relational Query Engine Relational Tables and Indices Text Management Component RDBMS Create Text IndexSQL/MMResults Relational Sub-query Text Query Engine Materialized Views for SVR Scores Novel Indices using B+-trees SQL Specification of SVR Scores Results & scores Keyword Query

Index Operations Document score updates –Handle frequent updates to scores Top-k keyword queries –Conjunctive and disjunctive keyword queries –Include IR-style (TF-IDF) scores –Top-k query results Content updates, insertions and deletions –Update to document content –Document insertions and deletions

Naïve Approach 1: ID Method golden 1012182134… gate 1113183439… (ordered by Id) Inverted ListScore Table IdScore 170.85 291.86 312.34 …... Score updates: efficient (just update score table) Top-k queries: inefficient (scan all of inverted list)

Naïve Approach 2: Score Method golden 156 gate (ordered by Score) Inverted List Top-k queries: efficient (top part of inverted list) Score updates: inefficient (reorganize many lists) 98.32 12 90.19 89 79.52 54 77.79 … … Score 176 97.19 12 90.19 64 89.55 4 84.63 … …

Dilemma Want inverted lists ordered by score –For top-k query performance –Like in Score Method But do not want to touch inverted lists for every score update –For score update performance –Like in ID Method How can we address this apparent dilemma?

Score-Threshold Method Extends Score Method in two key aspects 1)Allow inverted list scores to be out-of-date by up to a threshold –Avoids having to frequently update inverted list Better score update performance –Need to scan more of inverted list (by up to a threshold) to correct for out-of-date score Slightly reduced query performance 2)Use “short” inverted list for scores that exceed threshold –More efficient than updating large inverted list

Score-Threshold Method golden 156 gate (ordered by Score) 98.32 12 90.19 89 79.52 … … 176 97.19 12 90.19 64 89.55 … … Short list Score Table IdScore 170.85 …… 1290.19 …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: 95

Score-Threshold Method golden 156 gate (ordered by Score) 98.32 12 90.19 89 79.52 … … 176 97.19 12 90.19 64 89.55 … … Score Table IdScore 170.85 …… 1295.00 …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: 95 1290.19false

Score-Threshold Method golden 156 gate (ordered by Score) 98.32 12 90.19 89 79.52 … … 176 97.19 12 90.19 64 89.55 … … Score Table IdScore 170.85 …… 1295.00 …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: 105 1290.19false

Score-Threshold Method golden 156 gate (ordered by Score) 98.32 12 90.19 89 79.52 … … 176 97.19 12 90.19 64 89.55 … … Score Table IdScore 170.85 …… 12105.0 …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: 105 1290.19false 12 105.0 12 105.0

Score-Threshold Method golden 156 gate (ordered by Score) 98.32 12 90.19 89 79.52 … … 176 97.19 12 90.19 64 89.55 … … Score Table IdScore 170.85 …… 12105.0 …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: 105 12105.0true 12 105.0 12 105.0

Query-Update Tradeoff Choice of threshold function If threshold(score) = score –Every update results in update to inverted list –Similar to Score Method If threshold(score) = infinity –No inverted list update, but scan all of list –Similar to ID Method Can control query-update tradeoff using threshold function –threshold(score) = r * score, r >= 1 –r: threshold ratio

Score-Threshold Method: Critique Provides good update-query tradeoff But! Requires score to be stored in inverted list –Increases size of inverted list –Decreases query performance Can we avoid storing scores in inverted list and still get update-query tradeoff?

Chunk Method Main idea: divide document collection into “chunks” based on original document score –Lowest 5000 documents in first chunk –Next higher 3000 documents in second chunk –Next higher 4000 documents in third chunk –… Organize inverted list by chunk, but order documents by Id within a chunk –Ordered approximately by score (chunk) like Score Method –Avoids storing scores like in ID Method

Chunk Method golden 12 gate (ordered by Chunk) 11 15689 10 … … 12 11 64156… … Short list Score Table IdScore 170.85 …… 1290.19 …... ListScore Table IdScoreInShortList

Chunk Method: Details Setting chunk boundaries –highdoc(c) = highest score of document in chunk c –For two successive chunks c1 and c2: highdoc(c1)/highdoc(c2) = r r = chunk ratio Update document in short list only if document score exceeds 2 chunk boundaries –2 chunks handles boundary cases

Chunk-TermScore Method Support combination of SVR and TF-IDF Combines Chunk Method with Fancy-ID Method [Long and Suel] –In addition to long and short lists (ordered by chunk), have short fancy list (ordered by TF-IDF) –Combined merge of all three lists Details in ICDE paper

Summary of Alternatives ID Method –Efficient updates, slow queries Score Method –Efficient queries, slow updates Score-Threshold Method –Efficient updates, Intermediate queries Chunk Method –Efficient updates, Efficient queries Chunk-TermScore Method –Efficient updates, Efficient queries, TF-IDF + SVR

Experimental Setup Two primary performance metrics –Time for a score update Only time to update inverted lists –Time for a top-k query Data sets –Real (Internet Archive): 60MB Thanks to Brewster Kahle and Jon Aizen –Synthetic: 805MB Compared all five alternatives + ID-TermScore (baseline for Chunk-TermScore)

Implementation Details Inverted lists implemented in BerkeleyDB –Long inverted lists as CLOBs Read in a page at a time during query processing –Short inverted lists as clustered B+ trees Since short inverted lists are updated Query algorithms implemented in C++

Inverted List Size ID Method: 145MB Score Method: 2768MB Score-Threshold Method: 847MB Chunk Method: 146MB ID-TermScore Method: 428MB Chunk-TermScore Method:430MB

Effect of Chunk Ratio Times in Milliseconds

Varying # Updates Times in Milliseconds

Varying k in Top-k

SVR + TF-IDF Times in Milliseconds

Summary of Alternatives ID Method –Efficient updates, slow queries Score Method –Efficient queries, slow updates Score-Threshold Method –Efficient updates, Intermediate queries Chunk Method –Efficient updates, Efficient queries Chunk-TermScore Method –Efficient updates, Efficient queries, TF-IDF + SVR

Related Work SQL/MM –Integrating keyword search with databases Banks, DBXplorer, Discover –Search “across” tuples, but simple or traditional IR ranking Top-k inverted lists and query processing –Do not handle score updates Inverted list updates –Handle only content updates, not score updates –Proposed techniques can handle content updates too

10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems

10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems Text search in databases Ranking based on structured values

10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems

Towards Unifying DB and IR XRank: Keyword search over semi-structured XML documents –Extends keyword search to work over both structured and unstructured data –SIGMOD 2003 [Guo et al.] TeXQuery: Query language for structured and unstructured data, structured and keyword queries –Precursor to W3C XQuery Full-Text –WWW 2004 [Amer-Yahia et al.]

Questions?

Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,

Similar presentations

Presentation on theme: "Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,

Similar presentations

Presentation on theme: "Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,"— Presentation transcript:

Similar presentations

About project

Feedback