期中考參考解答 Date: 2005/12/14 Multimedia Information Systems.

期中考參考解答 Date: 2005/12/14 Multimedia Information Systems

1 第一題 (20%) Imagine you are hired to examine a text retrieval system in vector model, which does not perform well. After the examination, what suggestions you might make for improving its efficiency, and what suggestions you might make for improving its effectiveness? – 這一題主要有兩小題，一是問如何增進運算速度，一是問如何讓結果更準。但是很多人答非所問 …… – 速度方面：考慮增進速度的方法時，不必考慮準確度，因此有下列方法。減少 index term 的數量 – 例如：原系統中有 1000 個 index term ，若減少至 100 個，則可以加速建立 index 簡化計算式硬體方面的改良 – 換更快的 CPU 或是加大記憶體 ( 當資料多時可以減少磁碟的存取次數 ) 。

2 第一題 – 準確度方面：考慮增進準確度的方法時，不必考慮速度，因此有下列方法。 User Feedback 調整參數 ( 大家寫過作業應該知道有些參數調的好，會對準確度有影響 ) Query expansion Preprocessing( 可以先對文件作一些分類等前處理 ) – 評分方式：由於這題有 20 分，因此每一小題十分。考慮到此題分數比較重，因此一個小題至少要寫兩種方法。每一個方法給五分。

3 第二題 (20%) Given the following ranked answers generated by an IR system: D2, D34, D62, D27, D236, D72, D5, D12, D23, D7, and the set of relevant documents for the corresponding query: {D3, D5, D27, D29, D34, D52, D78, D103, D152, D236}. (a) Draw a recall-precision curve. (b) Draw a graph of precision at the relevant document cutoff values 1, 2, 3, and 4. (c) What is its R-precision value? –(c) R 代表的是此 Query 的 relevant documents 的數目。而 R- precision 就是我們的系統找回來的答案中前 R 個的 precision 。此題中， relevant documents 有十篇，故 R=10 。所以我們計算系統找回來的答案中排名前十篇的 precision = 0.4 。故 R-precision =0.4 。

4 第二題 2010 20 30 40 50 60 70 80 90 100 30405060708090100 Precision (%) Recall (%) 0 –(a) 所畫出來的 Recall-Precision curve 如下圖所示。其中，橫軸為 Recall 。在畫 Recall-Precision curve 時的座標軸都是 % 。

5 第二題 1 10 20 30 40 50 60 70 80 90 100 234 Precision (%) The number of relevant documents have been seen 0 –(b) 所畫出來的 cutoff values 如下圖所示。由於 cutoff values 是計算系統找回指定數量的 relevant documents 時，所算出來的 precision 。而本題指定的數目為 1, 2, 3, 4 ，所以橫軸為所指定的數目。

6 第二題 – 評分標準：此題共 20 分，其中 (a) 和 (b) 各佔 7 分， (c) 佔 6 分。 (a) 和 (b) 座標軸畫錯扣三分，線畫錯扣四分。 (c) 小題可以用解釋的，也可以用算的。基本上只要算出 0.4 就可以了。

7 第三題 (20%) How to construct an inverted index? How to use it to process a phrase query? Why it is a good index structure? –(a) inverted index 主要是兩個部分所組成，一是 vocabulary ，一是 occurrences 。其中， vocabulary 記載所有的 index term ，並依照字母順序排序。而 occurrences 則記載相對應的 index term 的所有出現位置。如下圖： text 出現在 11 和 19 ，則兩個位置都被 occurrences 記錄。 This is a text. A text has many words. Words are made from letters. 1691117192428334046505560 Letters Made Many Text words 60, …… 50, …… 28, …… 11, 19, …… 33, 40, …… vocabularyoccurrences

8 第三題 –(b) 這題題目問的是如何處理 phrase query ，而一個 phrase 通常是由許多 words 所組成。因此，我們必須先利用 inverted index 來查出 phrase 中的所有 words 出現的位置。然後再計算這些位置是否符合 words 之間的順序與距離，若是，則為答案。例如下圖，如果要找尋的 phrase 為 many words ，則我們利用 inverted index 找出 many 的位置為 28 ， words 的位置為 33 和 40 。接著，我們計算出 many 和 words 中間的距離為 5 ，因此 28+5=33 ，所以我們可知有一個答案符合 query ，為 many 在 28 ， words 在 33 。 This is a text. A text has many words. Words are made from letters. 1691117192428334046505560 Letters Made Many Text words 60, …… 50, …… 28, …… 11, 19, …… 33, 40, …… vocabularyoccurrences

9 第三題 –(c) inverted index 和其他 index 比起來有下列優點：比起其他 index 來說， inverted index 較好實作。使用空間小，大約只有 text size 的 30% ~ 40% 。當整個 index 很大而無法放入記憶體時，由於 inverted index 由兩部分組成，因此可以只將 vocabulary 的部分放在記憶體。而 occurrences 則在需要用到時才讀出該用的部分即可。如此能減少磁碟存取的次數。

10 第三題 – 評分標準：此題共 20 分，其中 (a) 和 (b) 各佔 7 分， (c) 佔 6 分。 (a) 小題只要有寫出是記載每個 index term 的出現位置即可 (b) 注意此處問的是 phrase query 。因此回答必須能處理 phrase query ，有些人的回答只能處理單一 index term 的 query ，這是不對的。 (c) 題目問的是為何 inverted index 是一個好的 index structure 。有人卻回答 index 的好處，這兩個是差很多的。要說明 inverted index 為何是一個好的 index structure 是要和別的 index 做比較。在 (c) 小題中，只要回答的方向是正確的，有寫出一個原因就可以了。在回答 (c) 小題時，答案必須要符合你所選擇的 inverted index structure ，例如：有人用 suffix trie 來當資料結構卻還寫能節省記憶體，這樣很奇怪。如果這樣寫會被扣三分。

11 第四題 (20%) How to process a phrase query using a signature file index? What is the boundary problem when processing such a phrase query? How to deal with the boundary problem? –(a) 要利用 signature file index 來處理 phrase query 可以分成下列幾步驟： (signature file index 如下圖所示 ) 將 phrase query 中所有 index term 的 signature 求出，並將所有得到的 signature 做 or 運算以求出 query 的 signature S Q 。將 S Q 與每個 block 的 text signature 做 and 運算，若得到的結果能相等於 S Q ，則有可能為答案；否則，就不可能有滿足該 phrase query 的答案了。對於可能為答案的 block ，實際將 data 取出檢查以確定 phrase 是否存在其中。 000011 110011 100100 101101 Text signature This is a text. A text has many words. Words are made form letters. Block 1Block 2Block 3Block 4 Text

12 第四題我們以左圖為例來說明上面的步驟： – 假設 phrase query 為 made.. letters ，則將 001100 與 100001 做 or 運算後我們可以得到整個 phrase query 的 signature 為 101101 。 – 用 101101 與各個 block 的 text signature 做 and 運算，發現只有 block 4 的 text signature 和 101101 做 and 運算後等於 101101 。因此，其他三個 block 都不會有滿足該 phrase query 。 – 實際檢查 block 4 ，發現存在 made letters ，因此符合 query 。 h(text) = 000011 h(many) = 110000 h(words) = 100100 h(made) = 001100 h(letters) = 100001 Signature function 000011 110011 100100 101101 Text signature This is a text. A text has many words. Words are made form letters. Block 1Block 2Block 3Block 4 Text

13 第四題 –(b) 由於 phrase query 是有許多 word 所組成，但是 document 在切割 block 時，可能把同一個 phrase 的 word 分成兩個 block ，這會照成我們使用 (a) 部分所提供的找尋 phrase 的方式找不到答案，此情形稱為 boundary problem 。以左圖為例：假設 phrase query 為 many words ，則 or 後得到 110100 。與所有 block 的 text signature 做 and 後都不能得到 110100 。按照 (a) 的說法，則不存在 many words 。但是我們發現 many words 存在 block 2 的最後和 block 3 的開始。 h(text) = 000011 h(many) = 110000 h(words) = 100100 h(made) = 001100 h(letters) = 100001 Signature function 000011 110011 100100 101101 Text signature This is a text. A text has many words. Words are made form letters. Block 1Block 2Block 3Block 4 Text

14 第四題 –(c) 要解決 boundary problem 需要考慮 phrase query 的最大可能長度。假設 phrase query 的最大可能長度為 K ，則對於每一個 block 的 text signature T ，我們可以將其後所有 n 個 block 的 text signature 與 T 做 or 運算，以求得新的 signature T’ 。其中，該 block 加上其後的 n 個 block 的長度須大於或等於 K 。以後就用 T’ 與 phrase query 的 signature 做比較。以左圖為例：假設 many words 為 phrase query 。可以發現兩個 block 的長度就大於 many words 的長度。因此我們在與每個 block 做比對時，需要將其 text signature 與其後一個 block 的 text signature 做 or 運算。因此我們得到 block 2 和 block 3 的 signature T’ 為 110111 ，可以發現 many words 的 signature 110100 與 T’ 做 and 後等於 110100 。我們取出 block 2 和 block 3 的 data 做最後檢查發現果然 many words 存在。 h(text) = 000011 h(many) = 110000 h(words) = 100100 h(made) = 001100 h(letters) = 100001 Signature function 000011 110011 100100 101101 Text signature This is a text. A text has many words. Words are made form letters. Block 1Block 2Block 3Block 4 Text

15 第四題 – 評分標準：此題共 20 分，其中 (a) 和 (b) 各佔 7 分， (c) 佔 6 分。 (a) 注意此處問的是 phrase query 。因此回答必須能處理 phrase query ，有些人的回答只能處理單一 index term 的 query ，這是不對的。 (a) 中的三步驟，有一個沒寫對就扣兩分，全錯扣七分。 (c) 小題還有其他作法，只要能解決 boundary problem 的我都給分。例如：有人用 phrase query 中的第一個 index term 所產生的 signature 來找到發生的位置。再將其後的 data 取出與 phrase 中的其他 index term 做比較 ( 只是這樣很慢 ……) 。

16 第五題 (20%) Compute the edit distance between the strings “xylitol” and “quijote” given the following costs: 1 for both insertion and deletion and 0.5 for replacement. xylitol 01234567 q10.51.52.53.54.55.56.5 u21.5123456 i32.521.52345 j43.532.52 3.54.5 o5 43.532.5 3.5 t65.554.54333 e76.565.5543.5 –Edit distance 就是兩個 strings 互相轉換所需要的 minimum cost 。因此，我們使用能求的 minimum cost 的 DTW 方法，計算如右圖所示。需要注意其中 replacement 的 cost 為 0.5 。

17 第五題 – 評分標準：所使用的方法必須要能保證所求出來的方法為 minimum cost 。若光只有對的答案，則只給五分。若有使用 DTW 則給五分。因為答案和 DTW 佔十分，所以整個計算過程有錯就分段扣分，直到剩下的十分扣完為止。

期中考參考解答 Date: 2005/12/14 Multimedia Information Systems.

Similar presentations

Presentation on theme: "期中考參考解答 Date: 2005/12/14 Multimedia Information Systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

期中考參考解答 Date: 2005/12/14 Multimedia Information Systems.

Similar presentations

Presentation on theme: "期中考參考解答 Date: 2005/12/14 Multimedia Information Systems."— Presentation transcript:

Similar presentations

About project

Feedback