Presentation on theme: "OGF19 Grid Information Retrieval Working Group January 30, 2007 Chapel Hill, NC."— Presentation transcript:
OGF19 Grid Information Retrieval Working Group January 30, 2007 Chapel Hill, NC
2 Agenda IP Policy reminder Introduce participants GIR-WG charter & overview GIR document status review Reference implementations Mention of related work elsewhere Paul Kim presentation Chris Fallen presentation Discussion
3 Session Particulars OGF IP policies apply GIR-WG chairs: Dr. Greg Newby, Arctic Region Supercomputing Center Dr. Paul Yangwoo Kim, Dongguk U. Nassib Nassar, RENCI
4 What is GIR-WG? GIR-WG was chartered by OGF to develop standards and reference implementations for information retrieval (IR) on computational grids. GIR-WG has published a Requirements document under GGF (GFD-I.027) Our first Experimental document was published recently (GFD-E.082) Progress on the Architecture document is dormant, awaiting practical experience Practical experience is being gained, and will result in at least further experimental documents.
5 What is Information Retrieval? IR is the science and method of delivering documents that are relevant to human information needs. Rather than delivering sets of matching documents (as DBMS do), IR systems rank matching documents. IR systems usually focus on textual input data (aka, natural language) either unformatted or formatted (plain text, HTML, XML, etc.)
6 GIR-WG Charter The GIR WG will establish a specific set of requirements, an architecture, and detailed specifications for Information Retrieval (IR) on computational grids. GIR will provide document collection management, indexing/searching, and query processing services to grid users and applications. GIR Milestones: GIR Requirements Document - Stakeholder-driven list of service-level requirements for building a grid-based IR system. Published in 2005 as GFD-I.27. GIR Architecture Document - Describes overall system comprised of integrated grid services, scenarios, etc. Draft under consideration since 2004; based on Experimental document outcomes, final version is expected in Experimental Documents - Experiences with GIR implementations or partial implementations (query processors, indexers, collection managers...). GFD-E.082 in 2006; others under consideration GIR Recommendation Draft Document - Describes each service in detail, with sections for different implementation platforms (such as Web Services, Grid Services, standalone...). Draft is expected after Architecture document, in GIR Recommendation Final Document - After the Draft Recommendation, based on independent interoperable implementations and further practical experiences. Within 2 years of the Draft Recommendation.
7 Why IR is a good candidate for Grid computing Excellent for divide and conquer coarse-grained parallelism Input items are discrete Coordination across subsets of a document collection can be minimal Results from multiple sources can be coordinated and relevance ranked together Queries may be handled independently
8 Significant Progress oDocuments: oGIR Requirements published oGIR Architecture in mid-draft (dormant) oExperimental document: published oImplementation: oMCNC released a technology preview oKims work: an experimental document oNewbys work: heading to an experimental document oNassars work: Sarcomere & Amberfish, open source toolkit based on GT4 oFallen & Newby distributed IR research
9 Requirements overview (per GFD-I.027) Desirability of Grid infrastructure for IR, notably enterprise IR: VO (for security, segmentation) Conceptual separation of functions (for indexing, collection management & query processing) Flexible but coarse-grained flow of control among elements Persistence of queries, collections and indexes Three primary components : Collection manager: handles input gathering, transformation, transport, staging and delivery Indexer: core information retrieval collection representation Query processor: respond to user needs, including standing information needs (i.e., information filtering)
10 Implementation Approaches Do not rely on particular implementations or middleware (e.g., Globus) Pursue different types of Grid implementations: Minimalist, home grown Globus-based Pure Web services These approaches can each be separate Experimental docs; will be appendices in the Architecture doc
11 GFD-E.082 Kim: Grid Information Retrieval System for Dynamically Reconfigurable Virtual Organization Practical experience on re-allocation of GIR nodes based on system load Indexer, collection manager or query processor, based on system load Dynamic reallocation of nodes within a computational grid
12 Nassar: Sarcomere See Sarcomere calls a collection of documents a "database". One or more "indexes" can be created per database. Each index represents an access point for searching the document collection. In theory, indexes can differ in how they constrain the queries (e.g. by fields), what kind of data structures are used, etc. At the moment only Amberfish full text indexes are supported (index type = "Amberfish"). Current port types (very rudimentary and highly subject to change): createDatabase deleteDatabase createIndex deleteIndex addDocument Search Stay tuned for more developments!
13 Newby: Multisearch How can we merge result sets from different IR engines? Desire to merge based on global relevance Challenging because different IR engines have different scoring/ranking algorithms Challenging because different collections have different characteristics, influencing ranking Used for TREC by Fallen & Newby 2005, 2006
14 Results are merged based on statistical normalization No accounting for different IR engines or different collections Simplifying assumptions that all IR rankings come from the same basic distribution Simple interface to an Axis/Tomcat backend
15 Opportunities for Interaction OGSA-DAI has middleware that provides basic query and result set transport Search from multiple databases; add a higher-level merger Seems promising for GIR!
16 Discussion of GIR-WG Your questions, thoughts and suggestions
17 Get Involved! Visit Subscribe to Talk with chairs about data and reference implementations and documents