Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處: institute of information science, academia sinica, taipei,

Similar presentations


Presentation on theme: "Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處: institute of information science, academia sinica, taipei,"— Presentation transcript:

1 Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處: institute of information science, academia sinica, taipei, taiwan,R.O.C. 學生:陳道輝、周鉦琪、葉飛 指導老師:黃三益 教授

2 Abstract PAT-tree-based adaptive approach PAT-tree-based adaptive approach IR application: automatic term suggestion, domain-specific lexicon construction, book indexing and document classification IR application: automatic term suggestion, domain-specific lexicon construction, book indexing and document classification

3 Introduction Keyphrase (keywords) extraction in Chinese language is a critical problem because of difficulties in word segmentation and unknown word identification.ex( 哈電族 ) Keyphrase (keywords) extraction in Chinese language is a critical problem because of difficulties in word segmentation and unknown word identification.ex( 哈電族 )

4 Definition of the Problems Lexical pattern: a string that consists of more than one successive character and has certain occurrences in a text collection with a specific domain. Lexical pattern: a string that consists of more than one successive character and has certain occurrences in a text collection with a specific domain. For example: 關鍵詞抽取 For example: 關鍵詞抽取 LPs: 關鍵、建詞、 詞抽、抽取、關鍵詞、 鍵詞抽、詞抽取、關鍵詞抽、鍵詞抽取、 關鍵詞抽取 LPs: 關鍵、建詞、 詞抽、抽取、關鍵詞、 鍵詞抽、詞抽取、關鍵詞抽、鍵詞抽取、 關鍵詞抽取

5 Definition of the Problems (cont) Complete lexical pattern: a LP with a complete meaning and lexical boundaries in semantics. Complete lexical pattern: a LP with a complete meaning and lexical boundaries in semantics. For example: 關鍵詞抽取 For example: 關鍵詞抽取 CLP: 關鍵、抽取、關鍵詞、關鍵詞抽取 CLP: 關鍵、抽取、關鍵詞、關鍵詞抽取

6 Definition of the Problems (cont) Significant lexical pattern: A CLP which is either “ specific ” or “ significant ” in the database Significant lexical pattern: A CLP which is either “ specific ” or “ significant ” in the database For example: 關鍵詞抽取 For example: 關鍵詞抽取 SLP: 關鍵詞、關鍵詞抽取 SLP: 關鍵詞、關鍵詞抽取

7 Definition of the Problems (cont) Definition 1:SLP Extraction Problem Definition 1:SLP Extraction Problem Definition 2:CLP Estimation Problem Definition 2:CLP Estimation Problem To solve problem 1, first we should solve problem 2 To solve problem 1, first we should solve problem 2

8 Definition of the Problems (cont) Proposed Approach: 3 modules Proposed Approach: 3 modules –Text analysis and PAT-tree indexing module –CLP extraction module –SLP extraction module

9 Definition of the Problems (cont)

10 Estimation of CLP Most CLP have strong associations between their composed and overlapped substrings Most CLP have strong associations between their composed and overlapped substrings Association Norm Estimation function Association Norm Estimation function If AE is large, it can be found that in many cases, patterns y and z will occur together is the text collection If AE is large, it can be found that in many cases, patterns y and z will occur together is the text collection ( 關鍵詞抽取、鍵詞抽取、關鍵詞抽 ) ( 關鍵詞抽取、鍵詞抽取、關鍵詞抽 )

11 Estimation of CLP (cont) It ’ s not enough to check if x has complete lexical boundaries using AE ( 關鍵詞 ) It ’ s not enough to check if x has complete lexical boundaries using AE ( 關鍵詞 ) To overcome this, we use two additional metrics, LCD (left context dependency) and RCD(right context dependency) ex. 李登輝 To overcome this, we use two additional metrics, LCD (left context dependency) and RCD(right context dependency) ex. 李登輝 By these metrics we can say: By these metrics we can say: –X is a CLP iff it has no LCD and RCD, and AE > (t3) threshold

12 Estimation of CLP (cont) X has LCD if |L| t2, where t1, t2 are threshold values, z E L and |L| means the number of unique right adjacent characters of x X has LCD if |L| t2, where t1, t2 are threshold values, z E L and |L| means the number of unique right adjacent characters of x X has RCD if |L| t2, where t1, t2 are threshold values, y E L and |L|means the number of unique right adjacent characters of x X has RCD if |L| t2, where t1, t2 are threshold values, y E L and |L|means the number of unique right adjacent characters of x

13 Text Analysis and PAT-Tree Indexing PAT tree uses as primarily implementation structure, and used for text retrieval and keyphrase extraction PAT tree uses as primarily implementation structure, and used for text retrieval and keyphrase extraction Use delimiter(, “ ”.) to determine a segment boundary, then build semi-infinite string Use delimiter(, “ ”.) to determine a segment boundary, then build semi-infinite string For example: 個人電腦, 人腦 For example: 個人電腦, 人腦 – 個人電腦, 人電腦, 電腦, 腦, 人腦, 腦 Node information (comparison bit, external nodes,frequency) Node information (comparison bit, external nodes,frequency) PAT Is easy for prefix search. PAT Is easy for prefix search. IPAT is easy for postfix search. IPAT is easy for postfix search.

14 Text Analysis and PAT-Tree Indexing (cont) Convert semi-infinite strings to bits Convert semi-infinite strings to bits According semi-infinite strings ’ bit sequences and differences to build PAT Tree According semi-infinite strings ’ bit sequences and differences to build PAT Tree We also create inverse PAT tree for inverse data streams of the database to check the occurrences of LSs and RSs We also create inverse PAT tree for inverse data streams of the database to check the occurrences of LSs and RSs ( 詞鍵關、詞鍵、詞鍵關展發、詞鍵關行進 ) ( 詞鍵關、詞鍵、詞鍵關展發、詞鍵關行進 )

15 Text Analysis and PAT-Tree Indexing (cont) Why use Pat tree (patricia) ? Why use Pat tree (patricia) ? –Log key value comparison times is low. –Computing time and space is down. –Efficient search. –We can use Pat tree to check RCD. –We can use Inverse Pat tree to check LCD.

16 Extraction of SLP A CLP is not always a SLP A CLP is not always a SLP –It cannot prove its significance in the text collection –Many CLP are commonly found in daily use All CLP is checked against a set of lexical rules and a general-domain corpus All CLP is checked against a set of lexical rules and a general-domain corpus Rules: Rules: –Numbers, Adverbs, Timing-related Terms –General Domain Pat Tree vs Specific Domain Pat Tree.

17 Evaluation Extraction of SLP Extraction of SLP –Ask 3 people to select CLPs and keyphrases from 50 “ seed sentence ” –Use these test data to test accuracy of SLP extraction Phrase length Total Number of Extracted Keyphrases Number of Correct Keyphrases Extracted Precision 23568331192.8% 3113066158.5% 499968768.77% 520715072.46% >=617815184.83% Total6082496081.55%

18 Evaluation (cont) Speed and Space Requirements Speed and Space Requirements Corpus Corpus size (KB) PAT Tree size (KB) Time to construct PAT tree (sec) Time to extract keyphrases (sec) C1-O(10k)12770.190.01 C2-O(100k)1276702.820.02 C3-O(1M)1033468725.521.62 C4-O(10M)1004844312306.3228.51 C5-O(100M)1073334390872381283

19 Conclusion This method reduced the difficulty of keyphrase extraction in Chinese, with better performance This method reduced the difficulty of keyphrase extraction in Chinese, with better performance

20 String Bit 1 917 25 個人電腦 / 節點 0 10101101 1101001110100100 … 人電腦 / 節點 2 10100100 0100100010111001 … 電腦 / 節點 4 10111001 0111000100000000 … 腦 / 節點 6 10111000 000000000000000 … 人腦 / 節點 9 10100100 0100100000000000 … 腦 / 節點 6 10111000 0000000000000000 … 0246891 個人電腦,人腦 節點號碼 Semi-infinite strings

21 ( 比較位元, 外部節點數, 字串次數 ) 0 6 4 9 2 (0,6,1) (4,6,1) (5,3,1) (24,2,1) ( 8,3,2) 0 4 2 9 6


Download ppt "Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處: institute of information science, academia sinica, taipei,"

Similar presentations


Ads by Google