Mining the Web Ch 3. Web Search and Information Retrieval 인공지능연구실 박대원.

Mining the Web Ch 3. Web Search and Information Retrieval 인공지능연구실 박대원

2 Contents  What is IR  Queries & Inverted Index  Relevance Ranking  Similarity Search

3 What is IR?  IR (Information Retrieval) –Prepare a keyword index for the given corpus –Respond to keyword queries with ranked list of documents  Web Search Engine –Based on IR system –Given corpus : Web

4 Queries & Inverted Index  질의 (Query) – 단어의 나열 찾고자 하는 문서를 대표할 수 있는 단어를 나열 –Boolean Query (typical query) Expression with terms and Boolean operator Examples –“Java” or “API”, “Java” and “island”, “Java” not “coffee” –Proximity query Term 의 위치 정보를 이용한 질의 예 ) phrase “java beans”, “java” and “island” in the same sentence

5 Queries & Inverted Index  ‘document-term’ relation – 문서 중심 – 문서 (document) : 단어들로 구성 ‘ 문서 내의 단어는 문서의 내용을 대표한다.’ !!! 원하는 문서를 찾으려면 ?

6 Queries & Inverted Index  Inverted Index –‘term-document’ relation – 단어 중심 –Posting File 단어 위치 포함 – 색인 문서에서 단어를 추출하는 과정 필요

7 Queries & Inverted Index  Indexing –Stopwords & stemming Stopwords – 예 : a, an, the, of, with 등 ( 영어 ); 조사, 어미 등 ( 한국어 ) –Stopwords 제거 »reduce index space »May reduce recall (in phrase search) » 예 : “to be or not to be” Stemming –match a query term with a morphological variant – 예 ) gains, gaining -> gain ; went, goes -> go

8 Queries & Inverted Index  Indexing –Batch indexing and update Changing index Indexing/updating uses 2 indices –Index compression Use data compression methods – 예 ) gamma code, delta code, Golomb code Gap xCoding Method Unary  Golomb b=3 b=6 1 2 3 4 5 6 7 8 9 10 0 10 110 1110 11110 111110 1111110 11111110 111111110 1111111110 0 10 0 10 1 110 00 110 01 110 10 110 11 1110 000 1110 001 1110 010 0 100 0 100 1 101 00 101 01 101 10 101 11 11000 000 11000 001 11000 010 0 0 10 0 11 10 0 10 10 11 110 0 110 10 110 11 1110 0 0 00 0 01 0 100 0 101 0 110 0 111 10 00 10 01 10 100 10 101

9 Relevance Ranking  Evaluation of IR –Recall 관련 있는 문서가 검색된 비율 –Precision 검색된 문서 중 관련 있는 문서의 비율

10 Relevance Ranking  Vector-space model D1D1 D2D2 Bit vector capturing essence/meaning of D 1 Query V1V1 V2V2 Q1Q1 Find max Sim (V i, Q 1 ) Sim (V 1, Q 1 ) _____________ Sim (V 2, Q 1 )

11 Relevance Ranking  Vector Space Model –Documents are represented as vectors –Term weight : tf*idf tf : term frequency idf : inverse document frequency –Cosine measure Sim(D,Q) =

12 Relevance Ranking  Relevance Feedback –Average web query : two words long Insufficient words –modify queries by adding or negating additional keywords. –Relevance feedback Query refinement process Rocchio’s method D+ : relevant documents, D- : irrelevant documents

13 Relevance Ranking  Probabilistic Relevance Feedback Models –Probabilistic models to estimate the relevance of documents –odds ratio for relevance Require too much effort –Bayesian inference network (chapter 5) Represented by the directed acyclic graphs having document, representation and concept layers of nodes Require manual mapping of terms to concepts

14 Relevance Ranking  Advanced Issues (Issues that need to be handled by the hypertext search engines) –Spamming Terms unnoticed by human, being noted by search engines Eliminate spam words by font color, position, repetition… Hyperlink-based ranking technique –Titles, headings, metatags, and anchor text No distinction for titles, headings, metatags, or anchors Web pages 의 구조화된 정보 이용 anchor-text 이용

15 Relevance Ranking  Advanced Issues (Issues that need to be handled by the hypertext search engines) –Ranking for complex queries including phrases Phrase dictionary Term 의 문서 ( 문장 ) 내 위치 이용 –Approximate string matching 부분적으로 일치된 단어 검색 N-gram 이용 –Meta-search systems

16 Similarity Search  Web data problem –Page replication, site mirroring, archived data, etc  Handling “Find-Similar” Queries –“find-similar” ( 유사 문서 검색 ) Given a “query” document d q, find some small number of documents d from the corpus D having the largest value of d q · d Similarity measure : Jaccard coefficient

17 Similarity Search  Eliminating Near Duplicates via Shingling –Comparing checksums of entire pages Maintain a checksum with every page in the corpus Detect replicated documents (depending on exact equality of checksum) –Measuring the dissimilarity between pages : edit distance Time-consuming work, impractical (all pairs of documents) –q-gram or shingle Contiguous subsequence of tokens taken from a document S(d,w) : set of distinct shingles of width w in document d

18 Similarity Search  Detecting Locally Similar Subgraphs of the Web (chapter 7) –Collapsing locally similar Web subgraphs can improve hyperlink- assisted ranking –Approaches to detecting mirrored sites Approach 1 –Suspected duplicates are reduced to a sequence of outlinks with all Href strings converted to a canonical form –Cleaned URLs assigned unique token IDs are listed and sorted to find duplicates or near-duplicates

19 Similarity Search  Detecting Locally Similar Subgraphs of the Web (chapter 7) –Approaches to detecting mirrored sites Approach 2 –Use regularity within URL strings to identify host pairs »Convert host and path to all lowercase characters »Let any punctuation or digit sequence be a token separator »Tokenize the URL into a sequence of tokens for example, www5.infoseek.com -> www, infoseek, com »Eliminate stop terms such as htm, html, txt, cgi, main, index, home,, »Form positional bigrams from the token sequence for example, ‘/cell-block16/inmates/dilbert/personal/foo.htm’ -> (cellblock,inmates,0),(inmates,dilbert,1),(dilbert,personal,2),(perso nal,foo,3) –Using “find-similar” algorithm

Mining the Web Ch 3. Web Search and Information Retrieval 인공지능연구실 박대원.

Similar presentations

Presentation on theme: "Mining the Web Ch 3. Web Search and Information Retrieval 인공지능연구실 박대원."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining the Web Ch 3. Web Search and Information Retrieval 인공지능연구실 박대원.

Similar presentations

Presentation on theme: "Mining the Web Ch 3. Web Search and Information Retrieval 인공지능연구실 박대원."— Presentation transcript:

Similar presentations

About project

Feedback