Presentation is loading. Please wait.

Presentation is loading. Please wait.

SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching

Similar presentations


Presentation on theme: "SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching"— Presentation transcript:

1 SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching
Noah Green Panagiotis G. Ipeirotis Luis Gravano Computer Science Dept., Columbia University

2 Web vs. “Hidden” Web Web Individual collections (or “Hidden” Web)
Link structure Crawlable Individual collections (or “Hidden” Web) No link structure Documents “hidden” behind search forms 1/14/2019 Columbia University Computer Science Dept.

3 Metasearching Given many document sources and a query, a metasearcher:
Finds the good sources for the query. Evaluates the query at these sources. Merges the results from these sources. Metasearcher Existing Web Application Non-indexed Documents Legacy Database / WAIS / etc. 1/14/2019 Columbia University Computer Science Dept.

4 Metasearching Issues How to evaluate the relevance of different sources? How to get metadata? How to query different types of sources? How to merge the results? Metasearcher title=‘biomedical’&… SELECT title FROM articles . . . grep ‘biomedical’ *.txt 1/14/2019 Columbia University Computer Science Dept.

5 Solution: A Common Protocol
= Search = Metadata Metasearcher S M grep cat select 1/14/2019 Columbia University Computer Science Dept.

6 Why “SDARTS = SDLIP+STARTS”?
NOT yet another protocol We combined existing efforts, keeping compatibility SDLIP defines a common interface for interacting with the sources STARTS defines expressive metadata that sources should export 1/14/2019 Columbia University Computer Science Dept.

7 SDARTS: Outline Description of SDLIP. Description of STARTS.
Integration of SDLIP and STARTS into SDARTS. Implementation and configuration of SDARTS wrappers. 1/14/2019 Columbia University Computer Science Dept.

8 Simple Digital Library Interoperability Protocol
SDLIP = Simple Digital Library Interoperability Protocol Developed during DLI2 project by: Stanford University UC Berkeley UC San Diego UC Santa Barbara San Diego Supercomputer Center California Digital Library 1/14/2019 Columbia University Computer Science Dept.

9 SDLIP: An Interoperability Protocol
Common SDLIP interface Basic interfaces: Search Metadata A wrapper implements these interfaces Interface parameter and return types are XML Transport layer implementations (HTTP, CORBA) S M DB-specific interfaces Flexible and adaptable Optimized for clients that know the source to query (i.e., simple requirements for metadata) 1/14/2019 Columbia University Computer Science Dept.

10 STARTS: Informal Standard for Search Engine Interoperability
Coordinated by Stanford in 1996; Both search engine vendors and "users“ participated: Netscape Microsoft Network GILS Infoseek Harvest Hewlett-Packard Fulcrum Verity Wais PLS Excite 1/14/2019 Columbia University Computer Science Dept.

11 STARTS: A Metasearching Protocol
Defines: Query language Results format Metadata for the collection No specified transport layer or implementation Naturally complements SDLIP for metasearching purposes Example of metadata: Stemming = no # of docs = 20,000 Diabetes  TF:12, DF: 4 XML  TF:1200, DF:750 1/14/2019 Columbia University Computer Science Dept.

12 SDARTS = SDLIP + SDARTS Extends SDLIP with a richer metadata
interface from STARTS Keeps compatibility with SDLIP (same DTDs) Can support easily similar protocols (transforming XML is easy) Makes wrapping collections easy through a toolkit 1/14/2019 Columbia University Computer Science Dept.

13 SDARTS: Implementation Details
Defined STARTS using XML; new version named “STARTS XML.” Used the getPropertyInfo() from SDLIP to extend SDLIP with STARTS metadata. Term frequency information is available through a different URL (faster download for metasearchers that do not use it). 1/14/2019 Columbia University Computer Science Dept.

14 Example of STARTS Metadata: “Content Summary”
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE starts:scontent-summary SYSTEM " <starts:scontent-summary xmlns:starts=" version="Starts 1.0" stemming="false" stopwords="false" case-sensitive="true" fields="false" numdocs="19997" > <starts:field-freq-info> … <starts:field type-set="basic1" name="body-of-text"/> <starts:term> <starts:value>algorithm</starts:value> </starts:term> <starts:term-freq>75</starts:term-freq> <starts:doc-freq>34</starts:doc-freq> 1/14/2019 Columbia University Computer Science Dept.

15 SDARTS Wrapper Design Rationale
Goal: Isolate developer from parsing and generating STARTS XML requests and responses Goal: Reusability and simplicity SDARTS toolkits and reference implementations Wrapping local text document collections Wrapping XML collections Wrapping HTTP/CGI interfaces 1/14/2019 Columbia University Computer Science Dept.

16 SDARTS Wrapping Architecture
SDLIP LSP Client Program STARTS XML over HTTP/DASL LSPObjects Internet SDARTS Bean BackEndLSP S FrontEnd LSP M Existing SDLIP Client STARTS XML Native Protocol/ Search Engine 1/14/2019 Columbia University Computer Science Dept.

17 SDARTS: Wrapper Implementation
Standardize on STARTS as the XML protocol for SDLIP Create a standard wrapper architecture LSPObjects STARTS XML BackEnd LSP S FrontEnd LSP M “Front-End”: Implements SDLIP interfaces Communicates with client using STARTS XML nested inside SDLIP method calls “Back-End”: Communicates with front-end using simple container objects Talks to underlying collection using native protocol Native Protocol/ Search Engine 1/14/2019 Columbia University Computer Science Dept.

18 Adding a Local Text Collection
Write standard doc_config.xml file Regular expressions to describe where to find fields No coding or compilation needed! doc_ config .xml index meta_ attributes .xml content_ summary .xml TextBackEndLSP Lucene Search Engine Non-indexed Text Documents 1/14/2019 Columbia University Computer Science Dept.

19 Sample doc_config.xml <doc-config re-index="true">
<path>/home/dli2test/collections/doc1/20groups</path> <linkage-prefix> <stop-words><word>the</word> <word>a</word></stop-words> <field-descriptor name="author"> <start><regexp>^From: </regexp></start> <end><regexp>$</regexp></end> </field-descriptor> </doc-config> 1/14/2019 Columbia University Computer Science Dept.

20 Adding a Local XML Collection
Write standard doc_config.xml file Write an XSL stylesheet to find fields in documents No coding or compilation needed! doc_style.xsl index meta_ attributes .xml content_ summary .xml doc_config.xml Apache Xalan XSL Processor Lucene Search Engine XMLBackEndLSP Non-indexed XML Documents 1/14/2019 Columbia University Computer Science Dept.

21 Adding an External Web Collection
Must code a custom wrapper to send correct CGI parameters and parse returning HTML No new code needed; uses XSLT for parsing the results Usually no metadata or content summary available Possible to automate metadata extraction: [Callan et al., SIGMOD’99]: Automatic extraction of vocabulary statistics [Ipeirotis et al., SIGMOD’01]: Automatic categorization of databases [Raghavan and Garcia-Molina, VLDB 2001]: Automatic interaction with forms meta_attributes.xml Web BackEnd LSP HTTP/CGI Collection 1/14/2019 Columbia University Computer Science Dept.

22 Conclusions Automatic metadata extraction for local collections
SDARTS uses SDLIP interfaces and code (compatible with it). SDARTS enhances SDLIP and STARTS. Reference wrappers available for common collection types. Any text or XML document collection can be easily wrapped without new compiled code. Automatic metadata extraction for local collections Using XSLT for web wrappers Possible to automate the extraction of rich metadata for web-accessible collections New wrappers can be written without having to parse or generate STARTS XML. SDARTS is in Java and can run on multiple platforms. 1/14/2019 Columbia University Computer Science Dept.

23 We are on the Web :) http://sdarts.cs.columbia.edu/
Available for downloading: SDARTS DTDs and documentation Java code and search engine (Lucene) included Source code documentation Web client source code Reference wrappers (text, XML, web) Wrapped collections The web client is web-accessible for the public to test and query our SDARTS server 1/14/2019 Columbia University Computer Science Dept.

24 Related Work Metadata: Interoperability Protocols: Open Archives
Dublin Core MARC Interoperability Protocols: Z39.50 GILS 1/14/2019 Columbia University Computer Science Dept.


Download ppt "SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching"

Similar presentations


Ads by Google