Search by strategy Arjen P. de Vries CWI, Spinque, Delft University of Technology.

Slides:



Advertisements
Similar presentations
Information System (IS) Stakeholders
Advertisements

GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Database Planning, Design, and Administration
Ch:8 Design Concepts S.W Design should have following quality attribute: Functionality Usability Reliability Performance Supportability (extensibility,
Chapter 5: Introduction to Information Retrieval
25 February 2009Instructor: Tasneem Darwish1 University of Palestine Faculty of Applied Engineering and Urban Planning Software Engineering Department.
12- Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall 1 Organizational Theory, Design, and Change Sixth Edition Gareth R. Jones Chapter.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Fakultät für Informatik Technische Universität München A Quantitative Perspective on Systems of Systems Formerly: Upscaling for Systems of Systems Astrid,
Introduction To System Analysis and Design
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Chapter 6 Database Design
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Review 4 Chapters 8, 9, 10.
6 Chapter 6 Database Design Hachim Haddouti. 6 2 Hachim Haddouti and Rob & Coronel, Ch6 In this chapter, you will learn: That successful database design.
Final Year Project LYU0301 Location-Based Services Using GSM Cell Information over Symbian OS Mok Ming Fai CEG Lee Kwok Chau CEG.
Chapter 5: Information Retrieval and Web Search
Basic Concepts The Unified Modeling Language (UML) SYSC System Analysis and Design.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Introduction to Database Concepts
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Overview of the Database Development Process
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2007 (Week 3, Tuesday 9/4/2007)
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
DBS201: DBA/DBMS Lecture 13.
Interacting With Data Databases.
Chapter 2 The process Process, Methods, and Tools
1 WEB Engineering Introduction to Electronic Commerce COMM1Q.
SWE 316: Software Design and Architecture – Dr. Khalid Aljasser Objectives Lecture 11 : Frameworks SWE 316: Software Design and Architecture  To understand.
COMP 410 & Sky.NET May 2 nd, What is COMP 410? Forming an independent company The customer The planning Learning teamwork.
Databases ? 2014, Fall Pusan National University Ki-Joune Li.
IST 210 Database Design Process IST 210 Todd S. Bacastow January 2005.
Week 4 Lecture Part 3 of 3 Database Design Samuel ConnSamuel Conn, Faculty Suggestions for using the Lecture Slides.
9/14/2012ISC329 Isabelle Bichindaritz1 Database System Life Cycle.
Introduction CS 3358 Data Structures. What is Computer Science? Computer Science is the study of algorithms, including their  Formal and mathematical.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Introduction To System Analysis and Design
1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.
Chapter 3 DECISION SUPPORT SYSTEMS CONCEPTS, METHODOLOGIES, AND TECHNOLOGIES: AN OVERVIEW Study sub-sections: , 3.12(p )
Greg Janée chit-chat with CS database folks 10/26/01 Gazetteer database 4.5 million items, each having: –1+ names fair to good discriminator –1 geospatial.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Introduction CS 3358 Data Structures. What is Computer Science? Computer Science is the study of algorithms, including their  Formal and mathematical.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Tuning Before Feedback: Combining Ranking Discovery and Blind Feedback for Robust Retrieval* Weiguo Fan, Ming Luo, Li Wang, Wensi Xi, and Edward A. Fox.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
The Software Development Process
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Concept mining for programming automation. Problem ➲ A lot of trivial tasks that could be automated – Add field Patronim on Customer page. – Remove field.
Automating programming via concept mining, probabilistic reasoning over semantic knowledge base of SE domain by Max Talanov.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
User Errors in Formulating Queries and IR Techniques to Overcome Them Birger Larsen Information Interaction and Information Architecture Royal School of.
IST 210 Database Design Process IST 210, Section 1 Todd S. Bacastow January 2004.
Datab ase Systems Week 1 by Zohaib Jan.
Chapter 6 Database Design
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Probabilistic Databases
Search Engine Architecture
Presentation transcript:

Search by strategy Arjen P. de Vries CWI, Spinque, Delft University of Technology

Interactive Information Access  Feedback:  Interaction improves information representation  Faceted Browsing:  Interaction can let user take over where machine would fail  Search by Strategy:  Interaction can let user take over where system designer would fail

Two extreme views on ‘search’ What is DB?  From business applications  Deductive reasoning  Precise and efficient query processing  Users with technical skills (SQL, XQuery, etc) and precise information needs Selection Books where category=‘CS’ What is IR?  From digital libraries, patent collections, etc  Inductive reasoning  Best-effort processing  Users with low technical skills and imprecise information needs Ranking Books about CS Note: SemWeb more DB than IR!!!

Complex search tasks (or, “not-so-simple” search tasks, if you like)  I want to buy a house in Amsterdam and I want it with ‘sfeer’ but still in good shape  I can afford about €350K. I need 3 bedrooms, the size should be about 80m 2. It should have a balcony or a backyard  The closer to the station and an AH, the better. BUT… I do not want to live in Amsterdam- Noord, unless there is a quick bus connection to the ferry  I may be willing to drop some of these constraints, but I’m not sure which

Search = IR + DB  Complex search tasks mix exact (DB) and ranked (IR) searches, on structured (DB) and unstructured (IR) data  Current technical solutions support either/or  SQL or Excel for structured data (DB)  Google or Lucene for unstructured data (IR)  Combining results requires significant effort  copy & paste result sets between interfaces, “human (probabilistic) joins”

Search = IR on-top-of DB ?  IR on-top-of DB: let exact and ranked operations both be processed by the same engine, so they can be mixed freely  IR responsible for ranking models, using DB as a data-access layer; no physical details necessary  DB responsible for reliable, dynamically optimised, data access; no logical details necessary

IR on-top-of DB???  Traditional, general-purpose DB technology cannot compete with custom IR search tools  Working assumption: using column stores should solve the efficiency problem

IR on-top-of DB – part I let $c := doc("papers.xml")//DOC[author = "John Smith"] let $q := "//text[about(.//Abstract, IR DB integration)]" for $res in tijah-query($c, $q) return $res/title/text() + + I am an IR and DB expert Hiemstra, Rode, Van Os, Flokstra, OSIR 2006 PF/Tijah: text search in an XML database system

IR on-top-of DB DATA AND QUERIES DATA MANAGEMENT Bla bla Bla bla bla, bla bla. Bla. Bla bla Bla bla bla, bla bla. Bla, bla bla Bla bla ENGINEERING Data Model Mismatch!!! To implement IR tasks on top of DB is a tough job! FORMALISM MAPPING Rölleke, Tsikrika, Kazai A general matrix framework for modelling Information Retrieval “IR concepts to matrix spaces” Cornacchia, De Vries, Van Ballegooij, Kersten SRAM (Sparse Relational Array Mapping) “matrix spaces to relational tables”

SRAM – query lifecycle

IR on-top-of DB – part II let $c := doc("papers.xml")//DOC[author = "John Smith"] let $q := "//text[about(.//Abstract, IR DB integration)]" for $res in tijah-query($c, $q) return $res/title/text() +++ I am an IR expert (needn’t be a DB expert as well) Cornacchia, De Vries, ECIR 2007 A Parametrised Search System

SRAM for IR # Language Modelling langmod(Q,d,λ) = sum( [ lm_t(d,t,λ) * Q(t) | t ] ) lm_t(d,t,λ) = log( λ * DTf(d,t) + (1-λ) * Tf(t) ) # Okapi BM25 bm25(Q,d,k1,b) = sum( [ w_t(d,t,k1,b) * Q(t) | t ] ) w_t(d,t,k1,b) = Tidf(t) * (k1+1) * DTf(d,t) / ( DTf(d,t) + k1 * ((1-b) + b * Dnl(d)/avgDl) ) DTnl = mxMult(mxTrnsp(LT), LD) # doc-term freq. Dnl = mvMult(mxTrnsp(LD), L) # doc lenght Tnl = mvMult(mxTrnsp(LT), L) # term freq. DTf = [ DTnl(d,t)/Dnl(d) | d,t ] # doc-term norm. freq. Tf = [ Tnl(t)/nLocs | t ] # term norm. freq. mxTrnsp(A) = [ A(j,i) | i,j ] mxMult(A,B) = [ sum([ A(i,k) * B(k,j) | k ]) | i,j ] mvMult(A,V) = [ sum([ A(i,j) * V(j) | j ]) | i ] Indexing.ram (excerpt) Linear_Algebra.ram (excerpt) Retrieval_Models.ram (excerpt)

Parameterised Search System (PSS) Cannot we ‘remove’ this IR engineer from the loop, like DBMS software removes the data engineer from the loop? Cornacchia, De Vries, ECIR 2007 A Parametrised Search System

Search by Strategy  Visually construct search strategies by connecting building blocks

Strategy Builder

From Patent to Inventor

Search by Strategy  Visually construct search strategies by connecting building blocks  Each block describes either data or actions upon that data  A search strategy may include multiple data sources  Data sources are internally represented as quadruples, triples extended with an additional probability value  Actions are actually scripts expressed in Fuhr and Roelleke’s PRA (TOIS 1997)

1. Which universities/colleges hold patents? 2. Who are the inventors named in those patents? 3. Which inventors are active in the area of our company? Real-life patent search example: Which researchers associated to universities and colleges should our Human Resources manager know to hire the right people on time?

Generate Search Engine

Work In Progress…

How Strategies Help  Strategies improve communication between search intermediary and user  Encapsulate domain expert knowledge  Abstract representation of search expert knowledge  Analyze information seeking process at any stage  Strategies facilitate knowledge management  Store / share / publish / refine  Strategies mix exact (DB) and ranked (IR) searches  Avoid the need for “human (probabilistic) joins”

Implementation  PRA translates into SQL (!)  Current system setup using CWI’s MonetDB column-store  Strategies are dynamically transformed into a REST API

Conclusion  Hand over control to the user (or, most likely, the search intermediary)  Patent information specialists  Digital forensics detectives  Librarians / archivists  Real estate agents  Travel agency

Open Issues  Assist the user make the best out of their increased level of control  Integrate usage data live system to help improve or adapt strategies  Handle “even larger” scale data  Patent demo fine on ~17GB semi-structured data (i.e., Fairview Research’s Green Energy collection), without specific optimizations, even with fairly large strategies  Close the loop!

Current Situation  index ;  repeat {  specify ;  retrieve  } until IR + DB IE

Desirable Situation  repeat {  index ;  specify ;  retrieve  } until IR + DB + IE