Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005.

Slides:



Advertisements
Similar presentations
Chapter 12 Decision Support Systems
Advertisements

Chapter 1: The Database Environment
Requirements Engineering Process
Copyright © 2003 Pearson Education, Inc.
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Take another look Alison Hayman Search Solutions Unit Dissemination Divison February 2011 Statistics Canada site search.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Taxonomy & Ontology Impact on Search Infrastructure John R. McGrath Sr. Director, Fast Search & Transfer.
1 Use of Electronic Resources in Research Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.
SCOPUS Searching for Scientific Articles By Mohamed Atani UNEP.
1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:
Copyright CompSci Resources LLC Web-Based XBRL Products from CompSci Resources LLC Virginia, USA. Presentation by: Colm Ó hÁonghusa.
Google Search Appliance November 2, 2010 Susan Fagan.
Library Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian 07 September
Electronic Resources in the EUI Library
CINAHL DATABASE FOR HINARI USERS: nursing and allied health information (Module 7.1)
LIBRARY WEBSITE, CATALOG, DATABASES AND FREE WEB RESOURCES.
- A Powerful Computing Technology Department of Computer Science Wayne State University 1.
Website Design What is Involved?. Web Design ConsiderationsSlide 2Bsc Web Design Stage 1 Website Design Involves Interface Design Site Design –Organising.
Introduction Lesson 1 Microsoft Office 2010 and the Internet
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Configuration management
Database Searching: How to Find Journal Articles? START.
Hash Tables.
Page Replacement Algorithms
1 University of Utah – School of Computing Computer Science 1021 "Thinking Like a Computer"
Microsoft Office Illustrated Fundamentals Unit C: Getting Started with Unit C: Getting Started with Microsoft Office 2010 Microsoft Office 2010.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Lecture 5: Requirements Engineering
Executional Architecture
DIKLA GRUTMAN 2014 Databases- presentation and training.
Chapter 12 Analyzing Semistructured Decision Support Systems Systems Analysis and Design Kendall and Kendall Fifth Edition.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Testing Hypotheses About Proportions
CINAHL Keyword Searching. This presentation will take you through the procedure of finding reliable information which can be used in your academic work.
RefWorks: The Basics October 12, What is RefWorks? A personal bibliographic software manager –Manages citations –Creates bibliogaphies Accessible.
CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.
1 A Systematic Review of Cross- vs. Within-Company Cost Estimation Studies Barbara Kitchenham Emilia Mendes Guilherme Travassos.
INFORMATION SOLUTIONS Citation Analysis Reports. Copyright 2005 Thomson Scientific 2 INFORMATION SOLUTIONS Provide highly customized datasets based on.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Advanced Searching Engineering Village.
Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.
Information Retrieval in Practice
Search Engines and Information Retrieval
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
INFO 624 Week 3 Retrieval System Evaluation
Information Retrieval
What’s new in search? Internet Librarian Oct 29 th 2007.
Overview of Search Engines
Library HITS Library HITS: Helpful Information for Trinity Students/Staff Library eResources for Sciences Michaelmas Term 2013 Trinity College Library.
Search Engines and Information Retrieval Chapter 1.
CINAHL DATABASE FOR HINARI USERS: nursing and allied health information (Module 7.1)
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Internet Search Strategies How and Where to Find What you Need on the Internet.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Advanced Semantics and Search Beyond Tag Clouds and Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Semantic (web) activity at Elsevier Marc Krellenstein VP, Search and Discovery Elsevier October 27, 2004
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Next generation search Marc Krellenstein VP, Search and Discovery Elsevier August 23, 2004
Information Retrieval in Practice
Federated & Meta Search
Eric Sieverts University Library Utrecht Institute for Media &
Introduction to Information Retrieval
Presentation transcript:

Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005

| 2 Why did we ever build federated search? No one search service or database had all relevant info or ever could have It was too hard to know what databases to search Even if you knew which dbs to search, it was too inconvenient to search them all Learning one simple interface was easier than learning many complex ones

| 3 Do we still need federated search? No

| 4 No one service or db has all relevant info? Databases have grown bigger than ever imagined Google: 8B documents, Google scholar: 400M+ ? Scirus: 200M Web of Knowledge (Humanities, Social Sci, Science): 28M Scopus: 27M Pubmed: 14M Why? Cheaper and larger hard disks Faster hardware, better software World-wide network availability…no need to duplicate

| 5 No one service or db has all relevant info? No maximum size in sight A good thing, because content continues to grow The simplest technical model for search Databases are logically single and central …but physically multiple and internally distributed Google has ~160,000 servers The simplest user model for search The catch (but even worse for federated search): Get the data Keep search quality high

| 6 Its hard to know what services to search? Google/Google Scholar plus 1-2 vertical search tools Pubmed, Compendex, WoK, PsycINFO, Scopus, etc. For casual searches: Google alone is usually enough Specialized smaller dbs where needed Known to researcher or librarian, or available from list Ask a life science researcher what they use -- All I need is Google and Pubmed

| 7 Its hard to know what services to search? Alerts, RSS, etc. eliminate some searches altogether Still…more than one search/source…but must balance inconvenience against costs of federated search: Will still need to do multiple searches…federated not enough Least common denominator search – few advanced features » Users are increasingly sophisticated Duplicates Slower response time Broken connectors The feeling that youre missing stuff…

| 8 One interface is easier to learn than many? Yes…studies suggest users like a common interface (if not a common search service) BUT Google has demonstrated the benefits of simplicity More products are adopting simple, similar interfaces There is still too much proprietary syntax – though advanced features and innovation justify some of it

| 9 So what are todays search challenges? Getting the data for centralized and large vertical search services Keeping search quality high for these large databases Answering hard search questions

| 10 Getting the data for centralized services Crawl it if its free …or make or buy it Expensive, but usually worth the cost Should still be cheaper for customers than many services …or index multiple, maybe geographically separate databases with a single search engine that supports distributed search

| 11 Distributed (local/remote) search Use common metadata scheme (e.g., Dublin Core) Search engine provides parallel search, integrated ranking/results Google, Fast and Lucene already work this way even for single database The separate databases can be maintained/updated separately Results are truly integrated…as if its one search engine One query syntax, advanced capabilities, no duplicates, fast Still requires common technology platform Federated search standards may someday approximate this Standard syntax, results metadata…ranking? Amazons A9?

| 12 Keeping search quality high in big dbs Can interpret keyword, Boolean and pseudo-natural language queries Spell checking, thesauri and stemming to improve recall (and sometimes precision) Get lots of hits in a big db, but thats usually OK if there are good ones on top

| 13 Keeping search quality high in big dbs Current best practice relevancy ranking is pretty good: Term frequency (TF): more hits count more Inverse document frequency (IDF): hits of rarer search terms count more Hits of search terms near each other count more Hits on metadata count more » Use anchor text – referring text – as metadata Items with more links/references to them count more » Authoritative links/referrers count yet more Many other factors: length, date, etc. Sophisticated ranking is a weak point for federated search Googles genius: emphasize popularity to eliminate junk from the first pages (even if you dont always serve the best)

| 14 But search challenges remain Finding the best (not just good) documents Popularity may not turn up the best, most recent, etc. Answering hard questions Hard to match multiple criteria » find an experimental method like this one Hard to get answers to complex questions, » What precursors were common to World War I and World War II? Summarize, uncover relationships, analyze Long-term: understand any question… None of the above helped by least common denominator federated search

| 15 Finding the best Dont rely too much on popularity Even then, relevancy ranking has its limits I need information on depression Ok…here are 2,352 articles and 87 books Need a dialog…what kind of depression …psychological…what about it? Underlying problem: most searches are under-specified

| 16 One solution: clustering documents Group results around common themes: same author, web site, journal, subject… Blurt out largest/most interesting categories: the inarticulate librarian model Depression psychology, economics, meteorology, antiques… Psychology treatment of depression, depression symptoms, seasonal affective… Psychology Kocsis, J. (10), Berg, R. (8), … Themes could come from static metadata or dynamically by analysis of results text Static: fixed, clear categories and assignments Dynamic: doesnt require metadata/taxonomy

| 17 Clustering benefits Disambiguates and refines search results to get to documents of interest quickly Can navigate long result lists hierarchically Would never offer thousands of choices to choose from as input… Access to bottom of list…maybe just less common Wont work with federated search that retrieves limited results from each Discovery – new aspects or sources Can narrow results *after* search Start with the broadest area search – dont narrow by subject or other categories first Easier, plus cant guess wrong, miss useful, or pick unneeded, categories…results-driven » Knee surgery cartilage replacement, plastics, …

| 18

| 19

| 20

| 21 Answering hard questions Main problem is still short searches/under-specification One solution: Relevance feedback – marking good and bad results A long-standing and proven search refinement technique More information is better than less Pseudo-relevancy feedback is a research standard Most commercial forms not widely used… …but Pubmed is an exception A catch: Must first find a good document to be similar to….may be hard or impossible

| 22 One solution: descriptive search Let the user or situation provide the ideal document – a full problem description – as input in the first place Can enter free text or specific documents describing the need, e.g., an article, grant proposal or experiment description Might draw on user or query context Use thesauri, domain knowledge and limited natural language processing to identify must-haves Uses lots of data and statistics to find best matches » Again, a problem for federated search with limited data access Should provide the best possible search short of real language understanding

| 23 Summarize, discover & analyze How do you summarize a corpus? May want to report on whats present, numbers of occurrences, trends Ex: What diseases are studied the most? Must know all diseases and look one by one How to you find a relationship if you dont know what relationships exist? Ex:does gene p53 relate to any disease? Must check for each possible relationship Ad hoc analysis How do all genes relate to this one disease? Over time? What organisms have the gene been studied in? Show me the document evidence

| 24 One solution: text mining Identify entities (things) in a text corpus Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants… Use lexicons, patterns, NLP for finding any or all instances of the entity (including new ones) Identify relationships: Through co-occurrence » Relationship presumed from proximity » Example: author-university affiliation Through limited natural language processing » Semantic relations – causes, is-part-of, etc. » Examples: drug-causes-disease, drug-treats-disease » Identify appropriate verbs, recognize active vs. passive voice, resolve anaphora (…it causes…)

| 25 Gene-disease relationships?

| 26 Relationships to p53

| 27 Author teams In HIV research?

| 28 Indirect links from leukemia to Alzheimers via enzymes

| 29 Long-term: answer any question Must recognize multiple (any) entities and relationships Must recognize all forms of linguistic relationship Must have background of common sense information (or enough entities/relations?) Information on donors (to political parties) For now, building text miners, domain by domain, is perhaps the best we can do Can build on preceding pieces…e.g., if you know drugs, diseases and drug-disease causation, can try to recognize advancements in drug therapy

| 30 Summary Federated search addressed problems of a different time Had a highly fragmented search space, limitations of individual dbs, technical and interface problems and need to just get basic answers Todays search environment is increasingly centralized and robust Range of content and demands of users continue to increase Adequate search is a given…really good search is a challenge best served by new technologies that dont fit into a least- common-denominator framework Need to locate best documents (sophisticated ranking, clustering) Need to answer complex questions Need to go beyond search for overviews, relationship discovery