Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 4 : Query Languages 學生 : 曾寶樂 學號 :88522070 課程老師 : 張嘉惠 報告日期 :89/10/26.

Similar presentations


Presentation on theme: "Chapter 4 : Query Languages 學生 : 曾寶樂 學號 :88522070 課程老師 : 張嘉惠 報告日期 :89/10/26."— Presentation transcript:

1 Chapter 4 : Query Languages 學生 : 曾寶樂 學號 :88522070 課程老師 : 張嘉惠 報告日期 :89/10/26

2 Outline Keyword-Based Querying Patten Matching Structural Queries Query Protocols Trends and Research Issues

3 Keyword-Based Querying A query is formulation of a user information need Keyword-based queries are popular 1.Single-Word Queries 2.Context Queries 3.Boolean Queries 4.Natural Language

4 Single-Word Queries A query is formulated by a word A document is formulated by long sequences of words A word is a sequence of letters surrounded by separators What are letters and separators?e.g, ’ on-line ’ The division of the text into words is not arbitrary

5 Context Queries definition - Search words in a given context,e.g,near other words types -phrase >a sequence of single-word queries >e.g,enhance retrieval -proximity >a sequence of single words or phrases, and a maximum allowed distance between them are specified >e.g,within distance(enhance,retrieval,4) will match ‘… enhance the power of retrieval …’

6 Boolean Queries Definition -A syntax composed of atoms that retrieve documents, and of Boolean operators which work on their operands -e.g,translation AND syntax OR syntactic

7 Boolean Queries Operands -(e 1 OR e 2 ) select all documents which satisfy e1 or e2 -(e 1 AND e 2 ) select all documents which satisfy both e1 and e2 -(e 1 BUT e 2 ) select all documents which satisfy e1 but not e2 “ fuzzy boolean ” -Retrieve documents appearing in some operands(The AND may require it to appear in more operands than the OR)

8 Natural Language generalization of “ fuzzy Boolean ” A query is an enumeration of words and context queries All the documents matching a portion of the user query are retrieved

9 Pattern Matching A pattern is a set of syntactic features that must occur in a text segment Types -words -prefixes e.q ‘ comput ’ -> ’ computer ’, ’ computation ’, ’ computing ’,etc -suffixes e.q ‘ ters ’ -> ’ computers ’, ’ testers ’, ’ painters ’,etc -substrings e.q ‘ tal ’ -> ’ coastal ’, ’ talk ’, ’ metallic ’,etc -Ranges between ‘ held ’ and ‘ hold ’ -> ’ hoax ’ and ‘ hissing ’

10 Pattern Matching Allowing errors  Retrieve all text words which all ‘ similar ’ to the given word  edit distance: the minimum number of character insertions,deletions,and replacements needed to make two strings equal, e.q, ‘ flower ’ and ‘ flo wer ’  maximum allowed edit distance: query specifies the maximum number of allowed errors for a word to match the pattern

11 Pattern Matching Regular expressions  union: if e 1 and e 2 are regular expressions, then(e 1 |e 2 ) matches what e 1 or e 2 matches  concatenation: if e 1 and e 2 are regular expressions, the occurrences of (e 1 e 2 ) are formed by the occurrences of e 1 immediately followed by those of e 2  repetition: if e is a regular expression, then (e*) matches a sequence of zero or more contiguous occurrence of e  ‘ pro(blem|tein)(s|є)(0|1|2)* ’ -> ’ problem2 ’ and ‘ proteins ’

12 Structural Queries Mixing contents and structure in queries -contents:words,phrases,or patterns -structural constraints:containment,proximity,or other restrictions on structural elements Three main structures -fixed structure -hypertext structure -hierarchical structure

13 Fixed Structure Document:a fixed set of fields EX: a mail has a sender, a receiver, a date, a subject and a body field Search for the mails sent to a given person with “ football ” in the Subject field

14 Hypertext A hypertext is a directed graph where nodes hold some text (text contents) the links represent connections between nodes or between positions inside nodes (structural connectivity)

15 Hypertext : WebGlimpse WebGlimpse: combine browsing and searching on the Web

16 Hierarchical Structure Recursive decomposition of the text

17 Hierarchical Structure

18

19

20 PAT Expressions Overlapped Lists Lists of References Proximal Nodes Tree Matching

21 PAT Expressions What is PAT tree? The areas of a region cannot nest or overlap

22 PAT Tree

23 Overlapped Lists The model allow for the areas of a region to overlap,but not to nest It is not clear,whether overlapping is good or not for capturing the structural properties

24 Lists of References Overlap and nest are not allowed All elements must be of the same type,e.g only sections,or only paragraphs. A reference is a pointer to a region of the database.

25 Proximal Nodes This model tries to find a good compromise between expressiveness and efficiency. It does not define a specific language, but a model in which it is shown that a number of useful operators can be included achieving good efficiency.

26 Tree Matching The leaves of the query can be not only structural elements but also text patterns, meaning that the ancestor of the leaf must contain that pattern.

27 Query Protocols Z39.50 WAIS (Wide Area Information Service)

28 Z39.50 American National Standard Information Retrieval Application Service Definition Can be implemented on any platform Query bibliographical information using a standard interface between the client and the host database manager Z39.50 protocol is part of WAIS

29 Z39.50 Brief history Z39.50-1988(version 1) Z39.50-1992(version 2) Z39.50-1995(version 3) Version 4,development began in Autumn 1995

30 Using Z39.50 over the WWW WWW ClientWWW Z39.50 Z39.50 Client Z39.50 Server Repository Digital library

31 WAIS (Wide Area Information Service) Beginning in the 1990s Query databases through the Internet

32 Trends and Research Issues ModelQueries allowed Boolean Vector Probabilistic BBN word,set operations words Relationship between types of queries and models

33 Boolean Model 布林運算式雖具有精確的語意, 但如何將 一篇文章以布林運算式表達也是一個問 題。 它是以二元比較, 缺乏「相似性」或「程 度上」的比較, 也就是無法進行相似文章 的查詢。

34 Vector Model 優點: (1) 以 Term-weight 的方法改善了 資料粹取的效率; (2) 它能允許相關文章 的查詢; (3) 它能計算文章間相似程度, 以找出最大相似度的文章。 缺點:這個模型假設了字串的獨立性, 若關鍵字在每篇文章都出現, 此關鍵字的 weight 將會是 0 如此便忽略了字在各文 章出現頻率不同所隱含的意義。

35 Probabilistic Model 主要優點在於能夠計算相似度的機率值, 但它有幾個缺點 : (1) 須要猜測一堆文章中 相關及不相關的集合 ;(2) 未考慮到字串在 文件中出現的頻率 ;(3) 對索引字串須假設 相互獨立 。

36 Bayesian Belief Network 是一個有向的非循環圖,其是由質和量兩個部 份所組成,質的部分是由領域相關的變數及變 數之間的交互關係所組成的有向圖,量的部分 是這些領域相關變數的聯合機率分佈 在這有向圖中,每個節點代表一個隨機變數, 每條連結線指出兩個變數之間的交互關係。簡 言之,這個有向圖是這些變數之聯合機率分佈 的分解表示法 。

37 Bayesian Belief Network 懷孕 (P) 導致荷爾蒙 (H) 改變 (ie. 影響荷爾蒙的狀態 ) 掃描 圖陰影 (S) 的改變、荷爾蒙的改變導致血液檢測 (B) 及尿 液檢測 (U) 的結果改變。

38 Trends and Research Issues The types of queries covered and how they are structured


Download ppt "Chapter 4 : Query Languages 學生 : 曾寶樂 學號 :88522070 課程老師 : 張嘉惠 報告日期 :89/10/26."

Similar presentations


Ads by Google