信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Xin4xi1 jian3suo3 yu3 sou1 suo3 yin3qing2 Spring 2018

Last week We have discussed about: QQ Group: 623881278 Website: PPTs…
Hashing (散列) and search trees (搜索树) Wildcard queries Spell correction QQ Group: Website: PPTs…

Course schedule (日程安排)
Lecture 1 Introduction Boolean retrieval (布尔检索模型) Lecture 2 Term vocabulary and posting lists Lecture 3 Dictionaries and tolerant retrieval Lecture 4 Index construction and compression Lecture 5 Scoring, weighting, and the vector space model Lecture 6 Computer scores, and a complete search system Lecture 7 Evaluation in information retrieval Web search engines, advanced topics, and conclusion

PHONETIC （语音的） CORRECTION
Write… Right… Rite… Wright

Phonetic correction Misspellings are often caused by a user typing a query that sounds like the target term. Phonetic hashing: try to group together all terms that sound similar.

Soundex algorithms Turn every term to be indexed into a 4- character reduced form Hermann  H655 Use these character to create an inverted index (dictionary 词典). The dictionary is called “soundex index” Do the same with query terms When a new query arrives, search using the soundex index.

How to calculate the 4 character codes?
Retain the ﬁrst letter of the term. Change all occurrences of the following letters to ’0’(zero): ’A’,E’, ’I’, ’O’, ’U’, ’H’, ’W’, ’Y’ Change letters to digits as follows: B, F, P, V to 1. C, G, J, K, Q, S, X, Z to 2. D,T to 3. L to 4. M, N to 5. R to 6. Repeatedly remove one out of each pair of consecutive identical digits Remove all zeros from the resulting text. Pad the resulting text with trailing zeros and return the ﬁrst four positions, which will consist of a letter followed by three digits.

Observation about Soundex
Vowels (元音) are viewed as interchangeable in transcribing names ’A’,E’, ’I’, ’O’, ’U’, ’H’, ’W’, ’Y’. Consonants (辅音) with similar sounds are considered to be the same. e.g. D and T These rules work for most European languages.

Chapter 4 – Index construction
PDF p.104…

Introduction We will talk about how to construct an inverted index. This process is called index construction or indexing ( 索引). It is performed by some software called an indexer (索引器). An inverted index to search for documents A collection of documents 索引 suǒ yǐn 索引器 suo3 yin3 qi4 Indexer Doc 1 Book1 Book1 Doc2… Book1

Introduction For a Web search engine like Baidu or Bing, the indexer is called a “spider” or “web crawler” (网络爬虫). A web crawler is a software that will browse the internet periodically to update its index of webpages. wǎng网 luò络 pá爬 chóng虫

Types of IR systems There are:
small scale Information Retrieval systems (e.g. to search documents in a company) large scale Information Retrieval systems (e.g. to search the Web). In general, we want an IR system to be fast. Thus, characteristics of the computer hardware (计算机硬件 ) must be considered.

Computer memory There are two main types of memory in a computer: Hard drive (硬盘驱动器) RAM memory (RAM芯片) Permanent storage Cheaper Slower Temporary storage Expensive Fast

About hardware 1) Data access time (访问时间)
Accessing the data in RAM is faster than accessing the data in a hard drive. To increase the speed of an IR system we should keep as much data as possible in RAM. We may use a computer having several gigabytes (GB) of RAM for an IR system. A technique called caching (缓存) consists of keeping the most frequently accessed data in RAM memory.

About hardware 2) How the data is organized is important
How the data is organized in memory also influences how fast the data can be read or written. In general, if the data that we read is stored contiguously (连续的) on the hard drive, then reading the data will be faster than if the data is not stored contiguously. Data stored contiguously Data not stored contiguously 3 1 2 3 4 5 1 4 2 5

About hardware 3) Data compression (数据压缩) can reduce the time for reading data on the hard drive Data compression refers to techniques for reducing the size of the data. If the data is smaller, reading it is faster. Uncompressed data Compressed data

Simple approach for index construction
Step 1. Each document from the collection is read. For each word, a <term, document ID > pair is created. e.g. this indicates that the term “Brutus” appears in document #1

Simple approach for index construction
Step 1. Each document from the collection is read. For each word, a <term, document ID > pair is created. e.g. this indicates that the term “Brutus” also appears in document #2

Step 2. All the pairs are sorted alphabetically
Thus, all pairs representing the same term now appears consecutively. e.g. “was”

Step 3. The pairs with same terms are then combined to create the inverted index (dictionary)

Step 3. The pairs are then used to create the inverted index (dictionary)
… … The term “Brutus” The frequency of this term (optional). Brutus appears in 2 documents The posting list. Brutus appears in documents 1 and 2

Example Reuters-RCV1: a collection of about 800,000 news documents published between August 20, 1996 and August 19, 1997. 1 GB of text, average: 200 tokens per document 400,000 terms

Example (cont’d) 100 million tokens
Each token requires 32 bits of memory Storing the texts takes 0.8 GB This collection of documents can fit in the memory of a desktop computer. However, for larger document collections, it is not possible 

Index construction If a computer has not enough RAM memory, the index must be created on the hard drive. At any given moment, only some part of the data can be stored in RAM memory. Thus, the list of <term, document ID> pairs must be stored on the hard drive. It must also be sorted on the hard drive. It is not easy to write a software (软件) program that does this. This is some advanced discussion. For more details, see p.71 of the book

Several variations of indexing
Several other approaches for indexing. Another one: A dictionary is created (empty) in RAM memory. Documents are read one by one to fill the dictionary. If the memory is full the current dictionary is saved to disk and a new dictionary is created in memory. The process continue to fill the new dictionary. Finally, all the dictionaries needs to be merged to obtain a single dictionary.

Distributed indexing 分布式索引
Up to now, we have discussed about indexing on a single computer. For large document collections (e.g. the World Wide Web), indexing cannot be done efficiently using a single computer. .Solution: Create a distributed index (分布式索引). It is an index that is stored on many computers.

The index is distributed on various computers either according to terms or documents. Here we will discuss indexes where the data is organized according to terms rather than documents.

In practice, distributed indexing is often done in the cloud (云计算) using technologies such as MapReduce What is the “cloud”? Many computers with standard parts (processor, memory, disk) that work together, up to a thousand computers,

In practice, distributed indexing is often done in the cloud (云计算) using technologies such as MapReduce What is the “cloud”? Many computers with standard parts (processor, memory, disk) that work together, up to a thousand computers, can survive the failure of some computers (multiple copies of the data is kept on multiple computers).

We will not talk about the details…

Dynamic indexing (动态索引)
We have until now assumed that a document collection is static (never changes, or is rarely changed). But most collections are not static New terms are added to the dictionary. New documents are added or removed (posting lists needs to be updated)

How to update a dictionary?
Simple approach: Rebuild the dictionary periodically from scratch (e.g. every day). This is acceptable if the number of changes over time is small. the delay in making new documents searchable is acceptable. enough computer resources are available to construct a new index while the old one is still being used.

Dynamic indexing with two indexes
If new documents needs to be indexed quickly: A main index is created to store documents and their posting lists An auxiliary index is kept in memory to store new documents and their posting lists. auxiliary index main index

When searching for documents, the search is done on both indexes and the results are merged. Then, the result is shown to the user. Deletions: a list is used to keep track of documents that have been deleted. Updates: updated documents are removed from the indexes and inserted again. auxiliary index main index

When the auxiliary index becomes too large, it is merged with the main index. This can be done periodically. main index auxiliary index Updated main index

How indexes are stored? To store a dictionary, a file can be created for each term, containing its posting list. However, many computers cannot handle well a large amount of files. A better approach: the dictionary is stored in a single file or a database (数据库). Other solutions may also be used. Shenzhen Beijing Brutus Automobile …

Performance Constructing a distributed index is more complicated than constructing an index that is stored on a single computer. But index construction and update can be very fast using a cloud (many computers). In practice, many search engine prefer to reconstruct the index from scratch, rather than trying to update it More details

A main index is used for searching
User (用户) main index searches for documents while a new index is being constructed. builds an updated index Indexer Updated index

Construction of positional indexes
We previously discussed positional indexes. Positional index (位置索引): a dictionary where the positions of terms in documents are stored. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35)…. Book20(3,500) … This indicates that « Shenzhen » appears as the 2nd, 24th and 35th word in “Book1” 39

Construction of positional indexes
Positional indexes are constructed in the same way as regular indexes. The main difference is that the position of terms in documents is kept and stored in the index. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35)…. Book20(3,500) … 40

Indexes for ranking Some IR systems rank documents from the most relevant to the least relevant. Most relevant Least relevant

Indexes for ranking The most relevant results should be shown first to the user. An approach is to sort the index by weight or impact (highest-weighted documents occur first in the index). This can allow to quickly stop a search for documents (since less important or unpopular documents are listed last).

Security for IR system Another important consideration of IR system is security. For example: Employees can search documents in the enterprise database. But some employees should not be able to access top-secret documents. Moreover, even the existence of a document can be sensitive (敏感的文件). Hence, the IR system should not show documents that a user cannot open.

How to ensure security? A solution: use an access control list (存取控制表). An access control list is a file that indicates the documents that each user can access. It can be viewed as a table (matrix) where rows are users and columns are documents. Documents Doc1 Doc2 Doc3 Doc4 User 1 1 User 2 User 3 … Users 0 : can’t read the document, 1 can read the document

How to ensure security? When a user searches for documents (e.g. user1): A set of documents is found that match the user’s query using an inverted index (dictionary). {Doc1, Doc2, Doc3} Then, the intersection of these documents and the documents that the user can access is calculated. Doc1 Doc2 Doc3 Doc4 User 1 1 {Doc1, Doc2, Doc3} The result is shown to the user: {Doc1}

Chapter 5: index compression
pdf p122

Introduction An index or dictionary can be very large if there are many documents. Compression (压缩): the process of reducing the size of an index. Several compression techniques. May reduce storage space required by up to 75 %. Benefits

Benefits of compression
1) We can save some disk space. 2) More data can fit in memory. Thus, we can increase the use of “caching (缓存) ” (keeping the most frequently accessed information in RAM memory, for faster access, and reducing the number of disk accesses). 3) Transferring data from disk to memory becomes faster because less data is transmitted (the data is compressed).

Time needed for compression
Using compression requires to compress ( 压缩数据) and uncompress data (压缩数据). This is not a difficult task. It can be done very quickly by a computer. Thus, the cost of compression and decompression is small compared to the benefits obtained by compression.

Statistical properties of terms in IR
Besides, if we apply preprocessing on a set of documents, the size of the dictionary will be reduced. An example: Reuters-RCV1 collection There are 485,494 terms.

Eliminating the 150 most common words from indexing cuts 25% to 30% of the non positional postings.

English vs other languages
The Ofxford English Dictionary : 600,00 words. But this excludes names, numbers, scientific terms, etc. The reduction achieved by compression is greater for some languages e.g. French The reason is that French is a morphologically richer language (形态丰富的语言) than English.

Two types of compression
Lossless compression (无损压缩): we reduce the space occupied by the data. but we do not lose any information. we will talk about this! Lossy compression (有损压缩): we reduce the space however some data is lost. can save more space.

Heaps’ law There is a law for estimating the number of terms in a collection of documents which is: NumberOfTerms = k x NumberOfTokensb In general: k ∈ 𝟑𝟎, 𝟏𝟎𝟎 b ~ 0.5 NumberOfTokens : the sum of the number of tokens in all documents.

In Reuters-RCV1, we have 38,365 words.
Example: for 1 million words, we can expect approximately 38,000 different terms. In Reuters-RCV1, we have 38,365 words. The parameter k depends a lot on the nature of the documents and how it is processed. Case folding and stemming reduce the growth-rate of vocabulary. Spelling errors and numbers increase the vocabulary growth

vocabulary size collection size relationship between collection size
and vocabulary size is often linear in log–log space vocabulary size collection size

Frequency of terms In real-life, few terms are accessed very often,
many terms are rarely accessed. We can take advantage of this for dictionary compression

How to store the dictionary?
Fixed length encoding: Each term is stored using a same amount of memory (e.g. 20 bytes for each term) Example: Problem: If we use a fixed amount of memory for each term, some memory is wasted because not all terms have the same number of characters!

How to store the dictionary?
Fixed length encoding: Each term is stored using a same amount of memory (e.g. 20 bytes for each term) Example: Problem 2: If the chosen size for storing a term is too small, some long terms cannot be stored in the dictionary. In this example, terms with more than 20 characters cannot be stored.

Variable length encoding: Each term is stored using a variable amount of memory
This can save a lot of memory!

This can save a lot of memory!
Block encoding: each term is preceded by a number indicating the number of letters in the term. This allow to reduce the number of pointers. This can save a lot of memory!

Front-coding If a dictionary is sorted, several consecutive words share the same prefix (前缀). This information can be used to further compress the dictionary. In this example, we don’t need to store “automat” several times. This saves memory!

An illustration of the compression
Explanation on next slide

Explanation of the previous slide
We have several words : automata, automate, automatic, automation. We want to compress this data to make it smaller. Since all these words start with automat we write: 8automat <-- Here 8 is the number of letters in "automat" Then, we write automata has follows: *a <-- This means that it is the same as "automat" but we must add character "a" to get "automata" Then, we write automate has follows: 1◊e <-- This means that it is the same as "automat" but we must add 1 character which is "e" to get "automate" 2◊ic <-- This means that it is the same as "automat" but we must add 2 characters which is "ic" to get "automatic" 3◊ion <-- This means that it is the same as "automat" but we must add 3 characters which is "ion" to get "automation"

How much reduction?

Compression of posting lists
It is also possible to compress posting lists. Normally, in a dictionary, for each term, we store the full list of documents where it appears. Each document is represented by a number (identifier), which uses a fixed amount of memory. To save memory, we can use a variable amount of memory to store the identifier of documents. Many approaches. See book p. 95

Compression vs Dictionary size
3600 MB for the collection of documents 107 MB for storing the index ( )

Conclusion Today, we have quickly discussed chapter 4 and 5.
We will continue next week… The PPT slides are on the website.

References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Similar presentations

Presentation on theme: "信息检索与搜索引擎 Introduction to Information Retrieval GESC1007"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Similar presentations

Presentation on theme: "信息检索与搜索引擎 Introduction to Information Retrieval GESC1007"— Presentation transcript:

Similar presentations

About project

Feedback