信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Xin4xi1 jian3suo3 yu3 sou1 suo3 yin3qing2 Spring 2018

Introduction Philippe Fournier-Viger
Full professor (教授) at HIT since 2015 Previously: professor at University of Moncton (Canada / 加拿大) Ph.D. in Computer Science (Canada) My specialty is data mining (数据挖掘/ 大数据) I am fluent in English and French, and can speak some basic Chinese.

How to contact with me? We can discuss immediately after lectures
(电子邮件): I check my s once a day Office: G421 Office phone: Wechat (微信): philfv

Teaching assistants There will be 2 teaching assistant(s) for this course: to grade homework, to check the attendance, help with other issues related to the course, QQ group:

About this course Evaluation: 16 hours (8 lectures)
1 credit, general course (本科专业课程, 通识课程） no requirements Evaluation: Attendance (上课): 10 % Course work (30%: two assignments 两个作业) Final exam (60% 考试)

Main objective （目的） Goal：understand how information retrieval systems work. For example: Web search engines (Baidu, Bing…) (搜索引擎), Text document retrieval systems (文本检索系统).

Specific goals What is an information retrieval system (信息检索系统 )？
Why are they used? How they work? An introduction to information retrieval techniques: Boolean retrieval, dictionary-based retrieval, index construction and compression, scoring, the vector space model, computing scores in a search system, … 

Specific objectives computing scores in a search system,
query processing, language models, text, vector space classification, clustering, web search including crawling, indexes and link analysis

Course schedule (日程安排)
Lecture 1 (today - 今天） Introduction Boolean retrieval Lecture 2 Term vocabulary and posting lists Lecture 3 Dictionaries and tolerant retrieval Lecture 4 Index construction and compression Lecture 5 Scoring, weighting, and the vector space model Lecture 6 Computer scores, and a complete search system Lecture 7 Evaluation in information retrieval Lecture 8 Web search engines, advanced topics, and conclusion

Lecture slides (PPTs) I will provide detailed lecture slides (PPTs)
I use this book to prepare the course: Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008 Note: It is not necessary to buy the book! (没有必要买这本书)

Course website Slides, assignments, grades, and the updated calendar will be provided on the course website:

Rules Students must attend all lectures. Students must arrive on time.
Homework must be submitted before the deadline. No late homework will be accepted.

Rules Plagiarism (抄袭 ) will not be tolerated.
Phones needs to be turned off during lectures One should not talk or eat in the classroom, or do anything that may disturb other classmates.

Lectures Each lecture starts at 18:30 Each lecture ends at 20:15
We will take a 5 minutes break around 19:20.

Chapter 1 – INTRODUCTION
Picture from:

Introduction What is Information Retrieval (IR) 信息检索?
Information 信息 Any kind of Information or data Retrieve (verb) 检索 To search, obtain, look up for..

Introduction Reading your student number on your student card (学生证). Is it information retrieval? We will use a more restrictive definition of information retrieval 

Introduction We will consider information retrieval as searching for documents. Database (数据库) User (用户) Query (查询) Search for documents olympics Documents Relevant documents

Information retrieval: finding documents of an unstructured nature (e
Information retrieval: finding documents of an unstructured nature (e.g. text) that satisfies some requirements. Database (数据库) User (用户) Query (查询) Search for documents olympics Documents Relevant documents

We assume that there is a large collection of documents (generally, stored on computers).
Database (数据库) User (用户) Documents

In the past: information retrieval was mostly done by librarians (图书馆员) and other professional searchers.

Today: Billions of people performs information retrieval using computers and cellphones…
Searching for webpages (网页), Searching s…

Today More and more data is collected and stored in computers (this is the big data era - 大数据时代). Why? Storing data is cheaper Networks are fast As a result, information retrieval is becoming the most important way of accessing information.

Types of documents Structured data (结构化数据): documents that have a clear and well-defined structure (e.g. data about bank customers) Searching structured data is easy (容易） Name = P. Black Country = Canada Age = 20 Name = X. Zhang Country = China Age = 25

Types of documents Unstructured data (非结构化数据): documents that have no clear structure, or are not organized in a predefined manner. (e.g. text documents) Searching unstructured data is hard (难） Information retrieval is mostly about searching unstructured data.

Information retrieval is also…
To let users browse documents (浏览文件)… To let users filter documents (过滤文件) …. To automatically group together books on similar topics (clustering - 聚类) To classify books into some categories (classification - 分类) To rank documents by relevance （相关排序)

Information retrieval at a large scale (大规模)
Web search (very large scale) billions of documents on millions of computers Important to search quickly A challenge: some websites try to cheat (欺骗 ) to appear more often in search engine results

Information retrieval at a small scale (小规模)
Personal search Search files using an operating systems (操作系统) such as Windows or Android Need to be fast Need to handle many different types of files Search s. service providers usually classify s into categories such as spam (垃圾邮件) and relevant e- mails. Need to be reliable

Information retrieval at a medium scale
Enterprise (企业) search Search documents of an enterprise database of patents (专利) database of research papers Documents are typically stored in a centralized database. Computers are used to search the databases.

An example 1.1 We have a set of books: We want to find all books containing: Beijing AND Shenzhen AND NOT Guangzhou How to search? 

How to search? Naive approach (幼稚的方法):
Read all books word by word to find the books containing Beijing and Shenzhen and not Guangzhou. For a human: this is very slow! For a computer: it is fast if there is not too many books Web: billions of documents….

We want to: Search quickly,
Search using flexible matching operations (灵活的匹配) e.g. search for all documents with Shenzhen « whithin 5 words » of Beijing rank the results by relevance (相关性排序 ): we want to find the best answer(s) to queries made by users

How to search quickly? We can create an index (索引) of the documents.
The index is created in advance. We search for documents using the index rather than reading the documents word by word. similar to how the table of content (目录) of a book helps us to find information in the book An example of index 

An example of index (索引)
A term-document incidence matrix (关联矩阵 ) can be used as index for searching documents. Terms Beijing Shenzhen … Guangzhou Book 1 1 Book 2 Book 3 Book 4 Book 5 Books 1= the term appears in the document 0 = it does no appear

Example 1: Search books with Beijing AND Shenzhen
Terms Beijing Shenzhen … Guangzhou Book 1 1 Book 2 Book 3 Book 4 Book 5 Books The result of this query is: { Book 1, Book 5}

Example 2: Search books with Beijing AND NOT Shenzhen

Example 3: Search books with Beijing AND OR Shenzhen

A term-document incidence matrix is said to be a boolean matrix (布尔矩阵) (a table containing 0 and 1).
Terms Beijing Shenzhen … Guangzhou Book 1 1 Book 2 Book 3 Book 4 Book 5 Books

Some vocabulary

Boolean retrieval model (布尔检索模型)
We can search for documents using Boolean expressions (布尔表达式). A Boolean expression is one or more terms combined with operators AND, OR and NOT. Example: Beijing AND Shenzhen AND NOT Guangzhou

Collection A collection (收集) is a set of documents that we may want to search. Example: N = 1 million documents Each document about 1000 words (2-3 pages) M = 500,000 distinct terms in these documents Size: 6 GB

Information need, query
Information need (信息需求): It is the topic about which the user wants to know more (e.g. I want to know about Beijing Olympics) Query (查询): It is what the user enter in a computer to search for documents. (e.g. « Beijing AND Olympics AND 2008 »)

Relevance (关联) A document is relevant if it contains information that is valuable given the information need of a person. e.g. a user wants to find about outdoor activities in Shenzhen. Challenge: relevant documents may or may not contains these words.

Information retrieval system (信息检索系统)
An IR system is a software or website that can be used to search for documents. e.g.

How to evaluate an IR system?
A good IR system is an IR system that can provide relevant documents to users. But how can we measure the relevance of documents? Two evaluation measures 

Precision (准确率) Precision: What fraction of the returned documents are relevant to the information need? Example: A person searches for webpages about Beijing The search engine returns: 5 relevant webpages 5 irrelevant webpages Precision = 5 / 10 = (50 %)

Recall (召回) Recall: What fraction of the relevant documents in a collection were returned by the system? Example: A database contains 1000 documents about HITSZ. The user search for documents about HITSZ. Only 100 documents about HITSZ are retrieved Recall = 100 / 1000 = (10 %)

Dictionary A dictionary is set of terms.
Each term is associated to a list of documents where it appears. Each entry of a list is called a posting. Example: Dictionary City Shenzhen Located China Book1, Book2, Book 20, Book 7…. Book1, Book3 … Book1, Book 20, Postings A posting

Dictionary A dictionary is usually much smaller than a matrix because not all words appear in all documents. Matrix City Shenzhen … Word 500,000 Book 1 1 Book 2 Book 3 Book 1000 1000 books ×500, 000 words = a huge matrix of 500,000,000 entries!!

How to create an index? Step 1: collect the documents to be indexed
Book1 Book2 Book3 Book100 …

How to create an index? Step 1: collect the documents to be indexed Step 2: tokenize the text (标记文本): separate it into words Book1 Book2 Book3 Book100 … Book1 « The city of Shenzhen is located in China… » token1 token2 … … token7 token8

How to create an index? Step 3: Linguistic preprocessing (语言的预处理) Keep only the terms that are useful for indexing documents. « The city of Shenzhen is located in China… » token1 token2 … … token7 token8 During that step, words can be also transformed if necessary: friends  friend wolves  wolf eaten  eat

How to create an index? Step 4: Create the dictionary
City | Shenzhen | Located | China Dictionary City Shenzhen Located China Book1, Book2, Book 20, Book 7…. Book1, Book3, Book 5, Book 9…. … Book1, Book 20, Book 34…

How to create an index? The index has been created! It can then be used to search documents. Dictionary City Shenzhen Located China Book1, Book2, Book 20, Book 7…. Book1, Book3, Book 5, Book 9…. … Book1, Book 20, Book 34…

Document ID (标识符) An IR system will generally assign a unique identifier (ID) to each document. Book 1, Book 2, Book 3….

Dictionary (cont’d) A dictionary is typically sorted to allow faster search (e.g. in alphabetical order - 按字母顺序) Dictionary City Shenzhen Located China Book1, Book2, Book 20, Book 7…. Book1, Book3, Book 5, Book 9…. … Book1, Book 20, Book 34…

Dictionary (cont’d) A dictionary is typically sorted to allow faster search (e.g. in alphabetical order - 按字母顺序) Dictionary China City Located Shenzhen Book1, Book 20, Book 34… Book1, Book2, Book 7, Book 20…. … … Book1, Book3, Book 5, Book 9….

A dictionary may be used to store statistics about documents
For example: Number of documents that contains each term (word)

How an IR system answers boolean queries?
Dictionary China City Located Shenzhen Book1, Book 20, Book 34… Book1, Book2, Book 7, Book 20…. … … Book1, Book3, Book 5, Book 9….

Dictionary China City Located Shenzhen Book1, Book 20, Book 34… Book1, Book2, Book 7, Book 20…. … … Book1, Book3, Book 5, Book 9…. The query is: CITY AND CHINA 1) Locate CITY in the dictionary

Dictionary China City Located Shenzhen Book1, Book 20, Book 34… Book1, Book2, Book 7, Book 20…. … … Book1, Book3, Book 5, Book 9…. The query is: CITY AND CHINA 1) Locate CITY in the dictionary 2) Retrieve its postings

Dictionary China City Located Shenzhen Book1, Book 20, Book 34… Book1, Book2, Book 7, Book 20…. … … Book1, Book3, Book 5, Book 9…. The query is: CITY AND CHINA 1) Locate CITY in the dictionary 2) Retrieve its postings 3) Locate CHINA in the dictionary

Dictionary China City Located Shenzhen Book1, Book 20, Book 34… Book1, Book2, Book 7, Book 20…. … … Book1, Book3, Book 5, Book 9…. The query is: CITY AND CHINA 1) Locate CITY in the dictionary 2) Retrieve its postings 3) Locate CHINA in the dictionary 4) Retrieve its postings

Dictionary China City Located Shenzhen Book1, Book 20, Book 34… Book1, Book2, Book 7, Book 20…. … … Book1, Book3, Book 5, Book 9…. The query is: CITY AND CHINA 1) Locate CITY in the dictionary 2) Retrieve its postings 3) Locate CHINA in the dictionary 4) Retrieve its postings 5) Do the intersection (交线) of the two lists RESULT: Book 1, Book20

Answering more complex queries
It is also possible to answer more complex queries using a similar approach: (Shenzhen OR CHINA) AND BEIJING) (Shenzhen AND CHINA) AND BEIJING)

Conclusion I presented the « Boolean retrieval model » where we search documents using terms and Boolean operators (e.g. AND, OR, NOT) This model is popular. e.g. often used to search for books in libraries. Why using this model? A document either match a query or does not. Thus, we can precisely search for documents. We will see extensions of this model with more features…

Conclusion Before next week, please make sure that you can access the course website

Exercises

An exercise This is an exercise that you can do at home if you want to review what we have learnt in the Chapter 1 of the book. b. Draw the dictionary (also called inverted index representation) for this collection c. What are the returned result for these queries? - schizophrenia AND drug - for AND NOT (drug OR approach)

References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Similar presentations

Presentation on theme: "信息检索与搜索引擎 Introduction to Information Retrieval GESC1007"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Similar presentations

Presentation on theme: "信息检索与搜索引擎 Introduction to Information Retrieval GESC1007"— Presentation transcript:

Similar presentations

About project

Feedback