Presentation is loading. Please wait.

Presentation is loading. Please wait.

CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.

Similar presentations


Presentation on theme: "CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University."— Presentation transcript:

1 CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University

2 10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems

3 10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems Text search in databases Ranking based on structured values

4 Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … SELECT * FROM Movies M ORDER BY score(M.description, “golden gate”) FETCH TOP 10 RESULTS ONLY Traditional IR scoring methods (e.g., TF*IDF) often not very meaningful in this context –Developed for stand-alone document collections

5 Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … Reviews Name 10bleblanc 2 Rating Mid 20 cooker4 10harry1 Rid 901 902 903 20 alice5904 ………… Statistics Visits 10285 90 Downloads Mid 20927247 Sid 81 82 ………… Structured Value Ranking (SVR)

6 Structured Value Ranking Use structured data values associated with text columns to score results Main technical challenge –Need to produce top-k results efficiently Order inverted lists by score –But scores change frequently [Aizen et al., 2004] Flash crowds on Internet Recent award announcements –How can we process top-k results efficiently while allowing frequent score updates?

7 Solution Overview Order inverted lists by score –Queries efficient –Score updates slow Order inverted lists by document id –Queries slow –Score updates efficient Hybrid solution: order inverted lists by chunk –Order chunks by score –Order documents within chunk by id Guo et al. [ICDE 2005]

8 10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems

9 Applications Content management –Mix of structured and unstructured data Database with date and time of accident (structured data) and accident description (unstructured data) –Semi-structured data Scientific documents, Shakespeare’s plays, … Support flexible keyword search interface over mix of structured and unstructured data –XRANK [Guo et al., SIGMOD 2003]

10 XML Keyword Search XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … … … Most specific results (exploits structure!) Ranking at granularity of elements (generalizes PageRank)

11 10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems

12 Applications The Internet is enabling end-users to directly ask queries and explore results –E.g., Used car marketplace –Find all “bright red ford mustangs” that cost less than 20% of the average price of cars in its class Characteristics of queries –Keyword search (for ease of use) –Complex query operations (information synthesis) –Want to see ranked results!

13 Towards Unifying DB and IR No standard query language for both DB and IR –SQL, XQuery mostly “database query languages” Have developed TeXQuery: a full-text search extension to XQuery –Amer-Yahia et al. (WWW 2004) –Full composability of database and IR primitives, ranking –Adopted as the precursor to the XQuery full-text extensions currently being developed by the W3C Come see demo tomorrow

14 Related Work Integrating DB and IR systems –For the most part, treat individual systems as “black boxes” –Our goal is to unify DB and IR systems Search over Semi-Structured Data –Specialized techniques for search semi-structured data –Our goal is to generalize DB and IR techniques Keyword search and ranking in databases

15 Summary Many emerging applications require a unification of DB and IR techniques –E-commerce applications –Semi-structured documents –Content management Argues for a new generation of systems and techniques that seamlessly provide this capability –SVR, XRank, TeXQuery, … Educational benefit: present unified view of data management –Currently at graduate level –Eventually introduce concepts at undergraduate level


Download ppt "CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University."

Similar presentations


Ads by Google