INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012.

Slides:



Advertisements
Similar presentations
Support.ebsco.com Searching the Petroleum Abstracts TULSA ® Database Tutorial.
Advertisements

Basic Searching Engineering Village. Agenda What is Engineering Village? Setting up a personal account Searching Engineering Village How to.
Support.ebsco.com CINAHL Basic and Advanced Searching Tutorial.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Google Chrome & Search C Chapter 18. Objectives 1.Use Google Chrome to navigate the Word Wide Web. 2.Manage bookmarks for web pages. 3.Perform basic keyword.
How to… Research Like An Expert! Day 2. Today’s Goals By the end of the period, I will: understand Boolean search operators have created a successful.
Advanced Searching Engineering Village.
Engineering Village ™ Basic Searching.
INFM 700 Course Review Paul Jacobs The iSchool University of Maryland May 2, 2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share.
INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Nov. 13, 2013.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Engineering Village ™ ® Basic Searching On Compendex ®
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
WMES3103 : INFORMATION RETRIEVAL
Searching TAL Online Developed by Northern Lights Internet Solutions Ltd. Advanced Searching.
Learn how to search for information the smart way Choose your own adventure!
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Unit 3 Web Search Engines. Can You Find the Answers? n Connect to Google Google n Search for items on Iran Records ________ n Combine Iran with nuclear.
Chapter 5: Information Retrieval and Web Search
How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting UC Berkeley SIMS class.
Overview of Search Engines
Federated Searching Pre-Conference Workshop - The federated searching cookbook Qin Zhu HP Labs Research Library February 18, 2007.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Welcome to the Web of Science tutorial By the end of this tutorial you should be able to: Do a basic search to find references Use search techniques to.
LIR 10: Week 8 Advanced Searching Techniques and Subject-Specific Databases.
Improving the Catalogue Interface using Endeca Tito Sierra NCSU Libraries.
INFM 700 Course Review Paul Jacobs The iSchool University of Maryland Tuesday, May 5, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Internet Business Foundations © 2004 ProsoftTraining All rights reserved.
SUMMON ® 2.0 DISCOVERY REINVENTED. What is Summon 2.0? A new, streamlined, modern interface New and enhanced features providing layers of contextual guidance.
Never-ending Search: (What you REALLY need to know about online searching) Ms. Emili school year.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Welcome to the Business Source Premier tutorial By the end of this tutorial you should be able to: Do a basic search to find references Use search techniques.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
INFM 700: Session 3 Organization and Navigation (cont’d) Paul Jacobs The iSchool University of Maryland Wednesday, Feb. 23, 2011 This work is licensed.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Manual LIMO Content  What’s LIMO?  Content of LIMO  Getting started in LIMO  Performing Searches  Using the Search Results  Managing.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Search Engine Optimization
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
CINAHL Basic and Advanced Searching
Search Techniques and Advanced tools for Researchers
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
MEDLINE with Full Text Searching
Presentation transcript:

INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for detailshttp://creativecommons.org/licenses/by-nc-sa/3.0/us/

iSchool Today’s Topics Very short recap Fundamentals of information retrieval Search engines in practice (web search and web sites) Issues and tricks Stemming/word issues Query formulation/expansion/assistance Tagging/structuring Others Deploying search – what we get to do, and how Issues and Tricks Deploying Search

iSchool Vector Space Model Assumption: Documents that are “close together” in vector space “talk about” the same things t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

iSchool Term Weighting Term weights consist of two components Local: how important is the term in this doc? Global: how important is the term in the collection? Here’s the intuition: Terms that appear often in a document should get high weights Terms that appear in many documents should get low weights How do we capture this mathematically? Term frequency (local) Inverse document frequency (global)

iSchool TF.IDF Term Weighting weight assigned to term i in document j number of occurrence of term i in document j number of documents in entire collection number of documents with term i

iSchool Summary thus far… Represent documents (and queries) as “bags of words” (terms) Derive term weights based on frequency Use weighted term vectors for each document, query Compute a vector-based similarity score Display sorted, ranked results

iSchool Issues and Tricks What’s a word/term? We can ignore words (“stop words”), combine (phrases), split up (“stem”) words Other special treatment (e.g. names, categories) Query formulation/suggestion Type of information need Popularity Based on link analysis/page rank Based on click through, other Structuring and tagging (e.g., “best bets”) Issues and Tricks Deploying Search

iSchool Issues and Tricks (cont’d) Thesaurus/query expansion Based on meaning, conceptual relationships Based on decomposition/type User feedback/”More like this” Clustering/grouping of results Issues and Tricks Deploying Search

iSchool Morphological Variation Handling morphology: related concepts have different forms Inflectional morphology: same part of speech Derivational morphology: different parts of speech Different morphological processes: Prefixing Suffixing Infixing Reduplication dogs = dog + PLURAL broke = break + PAST destruction = destroy + ion researcher = research + er Issues and Tricks Deploying Search

iSchool Stemming Dealing with morphological variation: index stems instead of words Stem: a word equivalence class that preserves the central concept How much to stem? organization  organize  organ? resubmission  resubmit/submission  submit? reconstructionism? Issues and Tricks Deploying Search

iSchool Does Stemming Work? Generally, yes! (in English) Helps more for longer queries, fewer results Lots of work done in this area But used very sparingly in web search – why? Donna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1):7-15. Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1): And others… Issues and Tricks Deploying Search

iSchool Beyond Words… Stemming/tokenization = specific instance of a general problem: what is it? Other units of indexing Concepts (e.g., from WordNet) Named entities Relations … Issues and Tricks Deploying Search

iSchool Some Observations Search engine fundamentals are very similar There are many tricks, differences beyond the basic model Differences appear differently, and are magnified as we get to sites, specific applications So, as we get to deployment … Be skeptical Test rigorously Some small things can make a big difference Issues and Tricks Deploying Search

iSchool Deployment - Overview What we can control Basic process of setting up/using search in IA Key parameters/issues What to search/organization content Testing and improving results Presentation/interfaces Issues and Tricks Deploying Search

iSchool What we control (the IA part)? Requirements and search engine selection Developing search requirements Build vs. buy Vendor evaluation/selection Consultants? Content selection What to search/zones/etc. Tags Search engine configuration Zones, what gets indexed, sometimes how Number of results, sometimes recall vs. precision Others (very often interface-related) Interfaces Issues and Tricks Deploying Search

iSchool Search Engine Selection Commercial examples Autonomy (including the former Verity, Ultraseek,...) Google (site search, search appliance) Thunderstone Build your own, open source? Lucene Defining requirements Basic search – how big, type of documents, what sort of interface, metadata, parametric? Advanced requirements – automatic tagging, alerts, “more like this” Customization and improvement using logs Keep it focused? Issues and Tricks Deploying Search

iSchool Search Engine Selection (con’d) Pitfalls to avoid “Getting a bargain” Getting it “free” Great sales reps Good ideas Get case studies, talk to references Get a “proof of concept period” Issues and Tricks Deploying Search

iSchool Simple Requirements Matrix Issues and Tricks Deploying Search Vendor Name Requirement/CriterionPriorityRatingComments 1. Identify Early Warnings/Search 1.a. Highly detailed information needs1 1.b. Date range restrictions1 1.c. Company name restrictions1 1.d. Alias/equivalence (e.g The Walt Disney Company = Disney)2 1.e. Ability to assign unique IDs (e.g. Disney = NYSE:DIS)2 1 f. Restrict/search by subject area/topic2 1.g. Ability to partition/segment articles with multiple topics3 1.h. Federated search w/web content, Nexis, etc.2 1.i. Use of extended lists (e.g. lists of companies, subjects)2 2. Identify Early Warnings/Alerts 2.a. Highly detailed information needs (all of i-h above)1 2.b. Controlling/weighting specific elements2 2.c. Recall/precision tradeoff3 2.d. Identify "new" and "hot" articles that match user's interest2 2.e. Sentiment analysis component3 3. Identify Early Warnings/Discovery 3.a. Classify documents in pre-defined or user-defined categories3 3.b. Document clustering3 3.c. Identification of trends/issues2 3.d. Other discovery tools3 4. Integration and interface requirements

iSchool Content Selection (What to Search) Generally, search everything but … Be leery about providing “search the web” option Use zones or separate text databases for frequent/infrequent information needs Be careful about outdated/deleted content Make sure “best bets” come to the top Use logs, test & improve Issues and Tricks Deploying Search

iSchool Testing and Improvement Keep track of queries (and results, if possible) using logs If logs are not available, try user experiments If results are not available, get them Relevance/correct judgments; quantitative (e.g. recall/precision) scores are, too How to improve Focus on most frequent (important?) requests (90-10 or 80-20) “Best bets” Content manipulation (e.g., adding tags) Thesaurus Keep testing Issues and Tricks Deploying Search

iSchool “Best Bets” – How to Implement Identify desired result page Determine possible query strings (from logs) Tag meta-data in documents with query string Configure search interface (e.g., to show Best Best first, what to do about multiple Best Bets) This is a special case of using tag field (e.g., keywords, categories, description) Issues and Tricks Deploying Search

iSchool Designing a Search Interface The Box (size, position, labels) Content selection (defaults, radio buttons or pull- down selection) Parameters or advanced search (Booleans, separate zones, other possibilities) Issues and Tricks Deploying Search

iSchool Designing a Search Interface - Results Number of results to display Recall/precision tradeoff? Snippet/summary information for each hit Layout of best bits/other hits Repetition of the query “No results” – other possible tips Iteration and refinement Other (e.g., scores, clusters, …) Issues and Tricks Deploying Search

iSchool Some example sites Issues and Tricks Deploying Search

iSchool Integrating Search and Browsing Provide more navigation for common needs …based on search logs, other info Redirect from search results to navigation Faceted browsing...

iSchool Faceted Browsing Example Issues and Tricks Deploying Search

iSchool Faceted Browsing Example Issues and Tricks Deploying Search

iSchool Faceted Browsing Example Demo: Issues and Tricks Deploying Search

iSchool Advantages of Facets Integrates searching and browsing Easy to build complex queries Easy to narrow, broaden, shift focus Helps users avoid getting lost Helps to prevent “categorization wars” Issues and Tricks Deploying Search

iSchool Recap Search is an IA issue! Quality of search results/user experience depends on: Understanding how search engines work Choosing and deploying carefully Constant testing and improvement Time Tremendous range of parameters/interface choices Integrating search and browsing/navigation is a very good idea Issues and Tricks Deploying Search