Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005.

Similar presentations


Presentation on theme: "Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005."— Presentation transcript:

1 Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005

2 | 2 Why did we ever build federated search? No one search service or database had all relevant info or ever could have It was too hard to know what databases to search Even if you knew which dbs to search, it was too inconvenient to search them all Learning one simple interface was easier than learning many complex ones

3 | 3 Do we still need federated search? No

4 | 4 No one service or db has all relevant info? Databases have grown bigger than ever imagined Google: 8B documents, Google scholar: 400M+ ? Scirus: 200M Web of Knowledge (Humanities, Social Sci, Science): 28M Scopus: 27M Pubmed: 14M Why? Cheaper and larger hard disks Faster hardware, better software World-wide network availability…no need to duplicate

5 | 5 No one service or db has all relevant info? No maximum size in sight A good thing, because content continues to grow The simplest technical model for search Databases are logically single and central …but physically multiple and internally distributed Google has ~160,000 servers The simplest user model for search The catch (but even worse for federated search): Get the data Keep search quality high

6 | 6 Its hard to know what services to search? Google/Google Scholar plus 1-2 vertical search tools Pubmed, Compendex, WoK, PsycINFO, Scopus, etc. For casual searches: Google alone is usually enough Specialized smaller dbs where needed Known to researcher or librarian, or available from list Ask a life science researcher what they use -- All I need is Google and Pubmed

7 | 7 Its hard to know what services to search? Alerts, RSS, etc. eliminate some searches altogether Still…more than one search/source…but must balance inconvenience against costs of federated search: Will still need to do multiple searches…federated not enough Least common denominator search – few advanced features » Users are increasingly sophisticated Duplicates Slower response time Broken connectors The feeling that youre missing stuff…

8 | 8 One interface is easier to learn than many? Yes…studies suggest users like a common interface (if not a common search service) BUT Google has demonstrated the benefits of simplicity More products are adopting simple, similar interfaces There is still too much proprietary syntax – though advanced features and innovation justify some of it

9 | 9 So what are todays search challenges? Getting the data for centralized and large vertical search services Keeping search quality high for these large databases Answering hard search questions

10 | 10 Getting the data for centralized services Crawl it if its free …or make or buy it Expensive, but usually worth the cost Should still be cheaper for customers than many services …or index multiple, maybe geographically separate databases with a single search engine that supports distributed search

11 | 11 Distributed (local/remote) search Use common metadata scheme (e.g., Dublin Core) Search engine provides parallel search, integrated ranking/results Google, Fast and Lucene already work this way even for single database The separate databases can be maintained/updated separately Results are truly integrated…as if its one search engine One query syntax, advanced capabilities, no duplicates, fast Still requires common technology platform Federated search standards may someday approximate this Standard syntax, results metadata…ranking? Amazons A9?

12 | 12 Keeping search quality high in big dbs Can interpret keyword, Boolean and pseudo-natural language queries Spell checking, thesauri and stemming to improve recall (and sometimes precision) Get lots of hits in a big db, but thats usually OK if there are good ones on top

13 | 13 Keeping search quality high in big dbs Current best practice relevancy ranking is pretty good: Term frequency (TF): more hits count more Inverse document frequency (IDF): hits of rarer search terms count more Hits of search terms near each other count more Hits on metadata count more » Use anchor text – referring text – as metadata Items with more links/references to them count more » Authoritative links/referrers count yet more Many other factors: length, date, etc. Sophisticated ranking is a weak point for federated search Googles genius: emphasize popularity to eliminate junk from the first pages (even if you dont always serve the best)

14 | 14 But search challenges remain Finding the best (not just good) documents Popularity may not turn up the best, most recent, etc. Answering hard questions Hard to match multiple criteria » find an experimental method like this one Hard to get answers to complex questions, » What precursors were common to World War I and World War II? Summarize, uncover relationships, analyze Long-term: understand any question… None of the above helped by least common denominator federated search

15 | 15 Finding the best Dont rely too much on popularity Even then, relevancy ranking has its limits I need information on depression Ok…here are 2,352 articles and 87 books Need a dialog…what kind of depression …psychological…what about it? Underlying problem: most searches are under-specified

16 | 16 One solution: clustering documents Group results around common themes: same author, web site, journal, subject… Blurt out largest/most interesting categories: the inarticulate librarian model Depression psychology, economics, meteorology, antiques… Psychology treatment of depression, depression symptoms, seasonal affective… Psychology Kocsis, J. (10), Berg, R. (8), … Themes could come from static metadata or dynamically by analysis of results text Static: fixed, clear categories and assignments Dynamic: doesnt require metadata/taxonomy

17 | 17 Clustering benefits Disambiguates and refines search results to get to documents of interest quickly Can navigate long result lists hierarchically Would never offer thousands of choices to choose from as input… Access to bottom of list…maybe just less common Wont work with federated search that retrieves limited results from each Discovery – new aspects or sources Can narrow results *after* search Start with the broadest area search – dont narrow by subject or other categories first Easier, plus cant guess wrong, miss useful, or pick unneeded, categories…results-driven » Knee surgery cartilage replacement, plastics, …

18 | 18

19 | 19

20 | 20

21 | 21 Answering hard questions Main problem is still short searches/under-specification One solution: Relevance feedback – marking good and bad results A long-standing and proven search refinement technique More information is better than less Pseudo-relevancy feedback is a research standard Most commercial forms not widely used… …but Pubmed is an exception A catch: Must first find a good document to be similar to….may be hard or impossible

22 | 22 One solution: descriptive search Let the user or situation provide the ideal document – a full problem description – as input in the first place Can enter free text or specific documents describing the need, e.g., an article, grant proposal or experiment description Might draw on user or query context Use thesauri, domain knowledge and limited natural language processing to identify must-haves Uses lots of data and statistics to find best matches » Again, a problem for federated search with limited data access Should provide the best possible search short of real language understanding

23 | 23 Summarize, discover & analyze How do you summarize a corpus? May want to report on whats present, numbers of occurrences, trends Ex: What diseases are studied the most? Must know all diseases and look one by one How to you find a relationship if you dont know what relationships exist? Ex:does gene p53 relate to any disease? Must check for each possible relationship Ad hoc analysis How do all genes relate to this one disease? Over time? What organisms have the gene been studied in? Show me the document evidence

24 | 24 One solution: text mining Identify entities (things) in a text corpus Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants… Use lexicons, patterns, NLP for finding any or all instances of the entity (including new ones) Identify relationships: Through co-occurrence » Relationship presumed from proximity » Example: author-university affiliation Through limited natural language processing » Semantic relations – causes, is-part-of, etc. » Examples: drug-causes-disease, drug-treats-disease » Identify appropriate verbs, recognize active vs. passive voice, resolve anaphora (…it causes…)

25 | 25 Gene-disease relationships?

26 | 26 Relationships to p53

27 | 27 Author teams In HIV research?

28 | 28 Indirect links from leukemia to Alzheimers via enzymes

29 | 29 Long-term: answer any question Must recognize multiple (any) entities and relationships Must recognize all forms of linguistic relationship Must have background of common sense information (or enough entities/relations?) Information on donors (to political parties) For now, building text miners, domain by domain, is perhaps the best we can do Can build on preceding pieces…e.g., if you know drugs, diseases and drug-disease causation, can try to recognize advancements in drug therapy

30 | 30 Summary Federated search addressed problems of a different time Had a highly fragmented search space, limitations of individual dbs, technical and interface problems and need to just get basic answers Todays search environment is increasingly centralized and robust Range of content and demands of users continue to increase Adequate search is a given…really good search is a challenge best served by new technologies that dont fit into a least- common-denominator framework Need to locate best documents (sophisticated ranking, clustering) Need to answer complex questions Need to go beyond search for overviews, relationship discovery


Download ppt "Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005."

Similar presentations


Ads by Google