Chapter 4 Processing Text. n Modifying/Converting documents to index terms n Why?  Convert the many forms of words into more consistent index terms that.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
The PageRank Citation Ranking “Bringing Order to the Web”
20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing & Feature Vector Extraction A. Xafopoulos,
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
CS 430 / INFO 430 Information Retrieval
Information Retrieval
Information Retrieval in Practice
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
3. Processing Text Azreen Azman, PhD SMM 5891 All slides ©Addison Wesley, 2008.
Search Engines and Information Retrieval Chapter 1.
1 Text Processing Rong Jin. 2 Indexing Process 3 Processing Text  Converting documents to index terms  Why? Matching the exact string of characters.
Chapter 4 Processing Text. n Modifying/Converting documents to index terms  Convert the many forms of words into more consistent index terms that represent.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Statistical Properties of Text
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Chapter 4 Processing Text. n Modifying/Converting documents to index terms  Convert the many forms of words into more consistent index terms that represent.
Chapter 6 Queries and Interfaces. Keyword Queries n Simple, natural language queries were designed to enable everyone to search n Current search engines.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Information Retrieval in Practice
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Text Based Information Retrieval
Information Retrieval in Practice
CS 430: Information Discovery
Lecture 15: Text Classification & Naive Bayes
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Thanks to Bill Arms, Marti Hearst
Information Retrieval
CS 430: Information Discovery
Information Retrieval in Practice
Information Retrieval and Web Design
Presentation transcript:

Chapter 4 Processing Text

n Modifying/Converting documents to index terms n Why?  Convert the many forms of words into more consistent index terms that represents the content of a document  Matching the exact string of characters typed by the user is too restrictive, e.g., case-sensitivity, punctuation, stemming it doesn’t work very well in terms of effectiveness  Not all words are of equal value in a search  Sometimes not clear where words begin and end Not even clear what a word is in some languages, e.g., in Chinese and Korean 2

Text Statistics n Huge variety of words used in text but many statistical characteristics of word occurrences are predictable  e.g., distribution of word counts n Retrieval models and ranking algorithms depend heavily on statistical properties of words  e.g., important/significant words occur often in documents but are not high frequency in collection 3

Zipf’s Law n Distribution of word frequencies is very skewed  Few words occur very often, many hardly ever occur  e.g., “the” and “of”, two common words, make up about 10% of all word occurrences in text documents n Zipf’s law:  The frequency f of a word in a corpus is inversely proportional to its rank r (assuming words are ranked in order of decreasing frequency) where k is a constant for the corpus 4 r f = k  f  r = k

Top 50 Words from AP89 5

Zipf’s Law 6 [Ha 02] Ha et al. Extension of Zipf's Law to Words and Phrases. In Proc. of Int. Conf. on Computational Linguistics Example. Zipf’s law for AP89 with problems at high and low frequencies According to [Ha 02], Zipf’s law  does not hold for rank > 5,000  is valid when considering single words as well as n-gram phrases, combined in a single curve.

Vocabulary Growth n Another useful prediction related to word occurrence n As corpus grows, so does vocabulary size. However, fewer new words when corpus is already large n Observed relationship (Heaps’ Law): v = k × n β where  Predicting that the number of new words increases very rapidly when the corpus is small 7 n is the total number of words in corpus k, β are parameters that vary for each corpus (typical values given are 10 ≤ k ≤ 100 and β ≈ 0.5) v is the vocabulary size (number of unique words)

AP89 Example 8 (k)(k) ()() v = k × n β

Heaps’ Law Predictions n Number of new words increases very rapidly when the corpus is small, and continue to increase indefinitely n Predictions for TREC collections are accurate for large numbers of words, e.g.,  First 10,879,522 words of the AP89 collection scanned  Prediction is 100,151 unique words  Actual number is 100,024 n Predictions for small numbers of words (i.e., < 1000) are much worse 9

Heaps’ Law on the Web n Heaps’ Law works with very large corpora  New words occurring even after seeing 30 million!  Parameter values different than typical TREC values n New words come from a variety of sources  Spelling errors, invented words (e.g., product, company names), code, other languages, addresses, etc. n Search engines must deal with these large and growing vocabularies 10

Heaps’ Law vs. Zipf’s Law n As stated in [French 02]:  The observed vocabulary growth has a positive correlation with Heaps’ law  Zipf’s law, on the other hand, is a poor predictor of high- frequency terms, i.e., Zipf’s law is adequate for predicting medium to low frequency terms  While Heaps’ law is a valid model for vocabulary growth of web data, Zipf’s law is not strongly correlated with web data 11 [French 02] J. French. Modeling Web Data. In Proc. of Joint Conf. on Digital Libraries (JCDL)

Estimating Result Set Size n Word occurrence statistics can be used to estimate the size of the results from a web search n How many pages (in the results) contain all of the query terms (based on word occurrence statistics)? n For the query “a b c”: f abc = N × f a /N × f b /N × f c /N = (f a × f b × f c )/N 2  f abc : estimated size of the result set using joint probability  f a, f b, f c : the number of documents that terms a, b, and c occur in, respectively  N is the total number of documents in the collection  Assuming that terms occur independently 12

TREC GOV2 Example Collection size (N) is 25,205, Poor Estimation Due to the Independent Assumption

Result Set Size Estimation n Poor estimates because words are not independent n Better estimates possible if co-occurrence info. available P(a ∩ b ∩ c) = P(a ∩ b) · P(c | (a ∩ b)) f tropical ∩ fish ∩ aquarium = f tropical ∩ aquarium × f fish ∩ aquarium / f aquarium = 1921 × 9722 / = 705 f tropical ∩ fish ∩ breeding = f tropical ∩ breeding × f fish ∩ breeding / f breeding = 5510 × / =

Result Set Estimation n Even better estimates using initial result set (word frequency + current result set)  Estimate is simply C/s where s is the proportion of the total number of documents that have been ranked and C is the number of documents found that contain all the query words  Example. “tropical fish aquarium” in GOV2 After processing 3,000 out of the 26,480 documents that contain “aquarium”, C = 258 f tropical ∩ fish ∩ aquarium = 258 / (3000 ÷ 26480) = 2,277 After processing 20% of the documents, f tropical ∩ fish ∩ aquarium = 1,778 (1,529 is real value) 15

Estimating Collection Size n Important issue for Web search engines, in terms of coverage n Simple method: use independence model, even not realizable  Given two words, a and b, that are independent, and N is the estimated size of the document collection f ab / N = f a / N × f b / N  N = (f a × f b ) / f ab  Example. For GOV2 f lincoln = 771,326 f tropical = 120,990 f lincoln ∩ tropical = 3,018 N = (120,990 × 771,326) / 3,018 = 30,922, (actual number is 25,205,179)

Tokenizing n Forming words from sequence of characters n Surprisingly complex in English, can be harder in other languages n Early IR systems:  Any sequence of alphanumeric characters of length 3 or more  Terminated by a space or other special character  Upper-case changed to lower-case 17

Tokenizing n Example (Using the Early IR Approach).  “ Bigcorp's 2007 bi-annual report showed profits rose 10%.” becomes  “bigcorp 2007 annual report showed profits rose” n Too simple for search applications or even large-scale experiments n Why? n Small decisions in tokenizing can have major impact on the effectiveness of some queries 18 Too much information lost

Tokenizing Problems n Small words can be important in some queries, usually in combinations  xp, bi, pm, cm, el paso, kg, ben e king, master p, world war II n Both hyphenated and non-hyphenated forms of many words are common  Sometimes hyphen is not needed e-bay, wal-mart, active-x, cd-rom, t-shirts  At other times, hyphens should be considered either as part of the word or a word separator winston-salem, mazda rx-7, e-cards, pre-diabetes, t-mobile, spanish-speaking 19

Tokenizing Problems n Special characters are an important part of tags, URLs, code in documents n Capitalized words can have different meaning from lower case words  Bush, Apple, House, Senior, Time, Key n Apostrophes can be a part of a word/possessive, or just a mistake  rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's 20

Tokenizing Problems n Numbers can be important, including decimals  Nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, n Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations  I.B.M., Ph.D., cs.umass.edu, F.E.A.R. n Note: tokenizing steps for queries must be identical to steps for documents 21

Tokenizing Process n First step is to use parser to identify appropriate parts of document to tokenize n Defer complex decisions to other components  Word is any sequence of alphanumeric characters, terminated by a space or special character, with everything converted to lower-case  Everything indexed  Example: 92.3 → 92 3 but search finds document with 92 and 3 adjacent  To enhance the effectiveness of query transformation, incorporate some rules into the tokenizer to reduce dependence on other transformation components 22

Tokenizing Process n Not that different than simple tokenizing process used in the past n Examples of rules used with TREC  Apostrophes in words ignored o’connor → oconnor bob’s → bobs  Periods in abbreviations ignored I.B.M. → ibm Ph.D. → ph d 23

Stopping n Function words (conjunctions, prepositions, articles) have little meaning on their own n High occurrence frequencies n Treated as stopwords (i.e., removed)  Reduce index space  improve response time  Improve effectiveness n Can be important in combinations  e.g., “to be or not to be” 24

Stopping n Stopword list can be created from high-frequency words or based on a standard list n Lists are customized for applications, domains, and even parts of documents  e.g., “click” is a good stopword for anchor text n Best policy is to index all words in documents, make decisions about which words to use at query time 25

Stemming n Many morphological variations of words  Inflectional (plurals, tenses)  Derivational (making verbs into nouns, etc.) n In most cases, these have the same or very similar meanings n Stemmers attempt to reduce morphological variations of words to a common stem  Usually involves removing suffixes n Can be done at indexing time/as part of query processing (like stopwords) 26

Stemming n Two basic types  Dictionary-based: uses lists of related words  Algorithmic: uses program to determine related words n Algorithmic stemmers  Suffix-s: remove ‘s’ endings assuming plural e.g., cats → cat, lakes → lake, wiis → wii Some false positives: ups → up Many false negatives: supplies → supplie 27

Porter Stemmer n Algorithmic stemmer used in IR experiments since the 70’s n Consists of a series of rules designed to the longest possible suffix at each step n Effective in TREC n Produces stems not words n Makes a number of errors and difficult to modify 28

Errors of Porter Stemmer n Porter2 stemmer addresses some of these issues n Approach has been used with other languages 29 { No Relationship } { Fail to Find a Relationship }

Link Analysis n Links are a key component of the Web n Important for navigation, but also for search  e.g., Example website  “Example website” is the anchor text  “ is the destination link  both are used by search engines 30

Anchor Text n Describe the content of the destination page  i.e., collection of anchor text in all links pointing to a page used as an additional text field n Anchor text tends to be short, descriptive, and similar to query text n Retrieval experiments have shown that anchor text has significant impact on effectiveness for some types of queries  i.e., more than PageRank 31

PageRank n Billions of web pages, some more informative than others n Links can be viewed as information about the popularity (authority?) of a web page  Can be used by ranking algorithms n Inlink count could be used as simple measure n Link analysis algorithms like PageRank provide more reliable ratings  Less susceptible to link spam 32

Random Surfer Model n Browse the Web using the following algorithm:  Choose a random number 0  r  1  If r < λ, then go to a random page  If r ≥ λ, then cli ck a link at random on the current page  Start again n PageRank of a page is the probability that the “random surfer” will be looking at that page  Links from popular pages increase PageRank of pages they point to 33

Dangling Links n Random jump prevents getting stuck on pages that  Do not have links  Contains only links that no longer point to other pages  Have links forming a loop n Links that point to the second type of pages are called dangling links  May also be links to pages that have not yet been crawled 34

PageRank n PageRank (PR) of page C = PR(A)/2 + PR(B)/1 n More generally,  where u is a web page B u is the set of pages that point to u L v is the number of outgoing links from page v (not counting duplicate links) 35

PageRank n Don’t know PageRank values at start n Example. Assume equal values of 1/3, then  1 st iteration: PR(C) = 0.33/ /1 = 0.5 PR(A) = 0.33/1 = 0.33 PR(B) = 0.33/2 = 0.17  2 nd iteration: PR(C) = 0.33/ /1 = 0.33 PR(A) = 0.5/1 = 0.5 PR(B) = 0.33/2 = 0.17  3 rd iteration: PR(C) = 0.5/ /1 = 0.42 PR(A) = 0.33/1 = 0.33 PR(B) = 0.5/2 = 0.25 n Converges to PR(C) = 0.4 PR(A) = 0.4 PR(B) =

PageRank n Taking random page jump into account, 1/3 chance of going to any page when r < λ n PR(C) = λ/3 + (1 − λ) × (PR(A)/2 + PR(B)/1) n More generally,  where N is the number of pages, λ typically

Link Quality n Link quality is affected by spam and other factors  e.g., link farms to increase PageRank  Trackback links in blogs can create loops  Links from comments section of popular blogs can be used as the source of link spam. Solution: Blog services modify comment links to contain rel = nofollow attribute e.g., “Come visit my web page.” 38

Trackback Links 39

Information Extraction n Automatically extract structure from text  Annotate doc using tags to identify extracted structure n Named entity recognition  Identify words that refer to something of interest in a particular application  e.g., people, companies, locations, dates, product names, prices, etc. 40

Named Entity Recognition n Rule-based  Uses lexicons (lists of words & phrases) to categorize names e.g., locations, peoples’ names, organizations, etc.  Rules also used to verify or find new entity names e.g., “ street” for addresses “, ” or “in ” to verify city names “,, ” to find new cities “ ” to find new names 41