4 Indexing Process Text acquisition Text transformation Index creation identifies and stores documents for indexingText transformationtransforms documents into index terms or featuresIndex creationtakes index terms and creates data structures (indexes) to support fast searching
6 Query Process User interaction Ranking Evaluation supports creation and refinement of query, display of resultsRankinguses query and indexes to generate ranked lists of documentsEvaluationmonitors and measures effectiveness and efficiency (primarily offline)
7 Indexing: Text Acquisition CrawlerIdentifies and acquires documents for search engineMany types – web, enterprise, desktopWeb crawlers follow links to find documentsMust efficiently find huge numbers of web pages (coverage) and keep them up-to-date (freshness)Single site crawlers for site searchTopical or focused crawlers for vertical searchDocument crawlers for enterprise and desktop searchFollow links and scan directories
8 Indexing: Text Acquisition FeedsReal-time streams of documentse.g., web feeds for news, blogs, video, radio, tvRSS (Rich Site Summary) is common standardRSS “reader” can provide new XML documents to search engineConversionConvert variety of documents into a consistent text plus metadata formate.g. HTML, XML, Word, PDF, etc.Convert text encoding for different languagesUsing a Unicode standard like UTF-8
9 Indexing: Text Acquisition Document data storeStores text, metadata, and other related content for documentsMetadata is information about document such as type and creation dateOther content includes links, anchor textProvides fast access to document contents for search engine componentse.g. result list generationWhy should we store documents ?
10 Indexing: Text Transformation ParserProcessing the sequence of text tokens in the document to recognize structural elementse.g., titles, links, headings, etc.Tokenizer recognizes “words” in the textMust consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separatorsMarkup languages such as HTML, XML often used to specify structureTags used to specify document elementsE.g., <h2> Overview </h2>Document parser uses syntax of markup language (or other formatting) to identify structureWhy do we need to recognize the structure of documents ?
11 Indexing: Text Transformation StoppingRemove common wordse.g., “and”, “or”, “the”, “in”Some impact on efficiency and effectivenessCan be a problem for some queriesStemmingGroup words derived from a common steme.g., “computer”, “computers”, “computing”, “compute”Usually effective, but not for all queriesBenefits vary for different languages
12 Indexing: Text Transformation Link AnalysisMakes use of links and anchor text in web pagesLink analysis identifies popularity and community informatione.g., PageRankAnchor text can significantly enhance the representation of pages pointed to by linksSignificant impact on web searchLess importance in other applicationsWhy should we deal with Links ?
13 Indexing: Text Transformation Information ExtractionIdentify classes of index terms that are important for some applicationse.g., named entity recognizers identify classes such as people, locations, companies, dates, etc.ClassifierIdentifies class-related metadata for documentsi.e., assigns labels to documentse.g., topics, reading levels, sentiment, genreUse depends on application
14 Indexing: Index Creation Document StatisticsGathers counts and positions of words and other featuresUsed in ranking algorithmWeightingComputes weights for index termse.g., tf.idf weightCombination of term frequency in document and inverse document frequency in the collection
15 Indexing: Index Creation InversionCore of indexing processConverts document-term information to term-document for indexingDifficult for very large numbers of documentsFormat of inverted file is designed for fast query processingMust also handle updatesCompression used for efficiency
16 Indexing: Index Creation Index distributionDistributes indexes across multiple computers and/or multiple sitesEssential for fast query processing with large numbers of documentsMany variationsDocument distribution, term distribution, replicationP2P and distributed IR involve search across multiple sites
17 Query: User Interaction Query inputProvides interface and parser for query languageMost web queries are very simple, other applications may use formsQuery language used to describe more complex queries and results of query transformatione.g., Boolean queries, Indri and Galago query languagessimilar to SQL language used in database applicationsIR query languages also allow content and structure specifications, but focus on content
18 Query: User Interaction Query transformationImproves initial query, both before and after initial searchIncludes text transformation techniques used for documentsSpell checking and query suggestion provide alternatives to original queryQuery expansion and relevance feedback modify the original query with additional terms
19 User Interaction Results output Constructs the display of ranked documents for a queryGenerates snippets to show how queries match documentsHighlights important words and passagesRetrieves appropriate advertising in many applicationsMay provide clustering and other visualization tools
20 Query: User Interaction Results outputConstructs the display of ranked documents for a queryGenerates snippets to show how queries match documentsHighlights important words and passagesRetrieves appropriate advertising in many applicationsMay provide clustering and other visualization tools
21 Query: Ranking Scoring Calculates scores for documents using a ranking algorithmCore component of search engineBasic form of score is qi diqi and di are query and document term weights for term iMany variations of ranking algorithms and retrieval models
22 Query: Ranking Performance optimization Distribution Designing ranking algorithms for efficient processingTerm-at-a time vs. document-at-a-time processingDistributionProcessing queries in a distributed environmentQuery broker distributes queries and assembles resultsCaching is a form of distributed searching
23 Query: Evaluation Logging Ranking analysis Performance analysis Logging user queries and interaction is crucial for improving search effectiveness and efficiencyQuery logs and clickthrough data used for query suggestion, spell checking, query caching, ranking, advertising search, and other componentsRanking analysisMeasuring and tuning ranking effectivenessPerformance analysisMeasuring and tuning system efficiency