Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enhancing Internet Search Engines to Achieve Concept-based Retrieval

Similar presentations


Presentation on theme: "Enhancing Internet Search Engines to Achieve Concept-based Retrieval"— Presentation transcript:

1 Enhancing Internet Search Engines to Achieve Concept-based Retrieval
F. Lu, T. Johnsten, V. Raghavan, and D. Traylor

2 Agenda Information on the Internet.
Boolean Retrieval Model and the Internet. Personalized Search. Concept-Based Retrieval (RUBRIC / CS3). CS3 and Boolean Search Engines. Deep Web Sources. Current & Future Work.

3 Information on the Internet
Large volume. Rapid growth rate. Wide variations in quality and type.

4 Boolean Retrieval Model and the Internet
Most Internet search engines are based on the Boolean Retrieval Model. Boolean Retrieval Model is relatively easy to implement. Limitations: Inability to assign weights to query or document terms. Inability to rank retrieved documents. Naïve users have difficulty in using

5 Personalized Search User Query Personalized Results
Personalized Engine Query Processor User Profile General Profile Result Processor Query Augmentation Search Results Search Engine

6 Concept-Based Retrieval
Address shortcomings of Boolean Retrieval Model. Search Requests specified in terms of concepts structured as rule-base trees.

7 Development of Rule-Base Trees (General)
Top-down refinement strategy. Support for AND / OR relationships. Support for user-defined weights.

8

9 Development of Rule-Base Trees (CS3)
Concept-Set Structuring System (CS3) CS3 supports the creation, storage and modification of user-defined concepts Post-processing of results of sub-queries CS3 user-interface.

10 CS3 User Interface

11 Evaluation of Rule-Base Trees (RUBRIC)
Run-time, bottom-up analysis. Propagation of weight values (MIN / MAX). Disadvantage of run-time analysis.

12

13 Evaluation of Rule-Base Trees (CS3)
Static, bottom-up analysis. Construct Minimal Term Set (MTS). Propagation of terms. CS3 user-interface.

14 MTS-Minimal Term Set A MTS for a topic is a set of terms such that if each term in the set appears in the document, the document would get a RSV larger than 0. If not, the RSV would be 0. A topic could have more than one MTSs. A user can choose from those MTSs to perform a search to his needs.

15

16

17

18

19 CS3 and Boolean Search Engines
CS3 is designed to interface with existing Boolean search engines. U.S. Department of Energy’s “Information-Bridge” search engine. U.S. Department of Transportation’s “National Transportation Library” search engine.

20 System Architecture Client (Java/ Applet ) CORBA CGI Server (JAVA)
Server (JAVA/C++) JDBC DOE InfoBridge etc. ORACLE

21 Information-Bridge and CS3
Search request: Boolean Vs. Concept Output: Non-Ranked Vs. Ranked. Calculation of RSV: Given a document D and a set S of MTS expressions satisfied by D, the RSV of D is equal to the sum of all the weights of S plus the maximum weight in S.

22 Information-Bridge and CS3 (Example)
Boolean search request (“Environmental Science Network” Form): (“Hydrogeology” OR “Dnapl” OR (“Colloid*” AND “Environmental Transport”)). Concept (CS3): “Hydrogeology”. Rule-Base Tree.

23 CS3 Hydrogeology Rule Base

24 CS3 search results

25 Deep Web Sources Also referred to as hidden Web or invisible Web
Resides behind search forms in databases e.g. monster.com, louisiana1st.com, PubMed. Web pages in deep Web are generated dynamically based on the submitted queries. Not indexed by current search engines. Search engines index content on the surface Web.

26 Deep Web Sources and Concept-based Retrieval
Deep Web in terms of size and quality: Size (Deep Web) = 500 * Size (Surface Web) Quality (Deep Web) = 1000 * Quality (Surface Web) Queries submitted at deep Web sources are more stable compared to queries submitted to search engines So, naturally concept-based retrieval is more suitable for deep Web sources

27 Current and Future Work
Conduct experiments to evaluate effectiveness (future). Investigate alternative methods to compute RSVs [KADR00, KDR01*]. Learning edge weights through relevance feedback [KR00]. Thesaurii based rulebase generation [KLR00].

28 Relevant URLs [LJRT99*] RaghavanHome  Publications since 1991


Download ppt "Enhancing Internet Search Engines to Achieve Concept-based Retrieval"

Similar presentations


Ads by Google