Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic.

Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic University

2 Outline Introduction Related Works Algorithm design  Classification Tree Traversal  Ranking nodes in the classification tree Experiments and Evaluations Conclusion and Future Works

3 Background Ontology Construction  Manual construction Corpus is not necessary Small scale  Automatic or semiautomatic construction Domain specific corpus Good domain knowledge coverage

4 Related Works Corpus Selection  Corpus by linguists British National Corpus (BNC) [Collin F. Baker, etc., 1998]  Corpus from Publications Reuters News Corpus [Latifur Khan, Feng Luo, 2002]  Corpus from Internet Searching Results from Web as Corpus [P Cimiano, etc., 2004]

5 Use of Wikipedia as a Resource Statistical and analysis work  [A Lih., 2004], [Jakob Voss, 2005] Link structure and cultural bias analysis of Wiki  [M Völkel, M Krötzsch, D Vrandecic, H Haller and R Studer., 2006 ], [F Bellomi and R Bonato, 2005] Add semantic links  Add semantic links between concepts in Wiki pages  [M Völkel, 2006], [Michael Strube, Simone Paolo Ponzetto, 2006] Corpus for XML retrieval  [L Denoyer, P Gallinari, 2006]

6 Problems Manually Selected Corpus  Domain experts needed  Time and labor intensive Corpus Collection from Publications  Limitation in time and region Internet Exploitation  Difficulty in domain specific data identification

7 Wikipedia Overview Established in 2001  500,000 articles in 2005  1 million articles in Nov. 2006  More than 2 millions of articles till now Different types of data Abundance of domain specific data  Availability of category information  Too many reachable nodes

8 Algorithm Design Basic Idea  Make use of the classification tree to only certain qualified reachable nodes Classification Tree Traversal  Given a Root node: P r (category node)  Breadth-First-Search Algorithm Initialization  W r = 1 for root node P r  W i = 0 if P i is not on the current traversal path

9 Tree traversal and weights Wiki Graph Classification Tree In-edge Out-edge N in (P) N out (P)

10 Ranking Schemes (1) S 1  Considering the sum of scores of P c ’s out-edges pointing to the classification tree against the total number of P c ‘s out-edges  The 1 in denominator is to avoid it being 0

11 Ranking Schemes (2) S 2  Considering the summation of P c ’s in-edges in the classification tree against the total number of the in- edges of P i s, which are P c ’s upper level nodes

12 Ranking Schemes (3) S 3  Considering the summation of the out-edge nodes in the classification tree divided by both P c ’s out-edge scores and its upper level nodes P i ’s in-edge scores

13 Data Wikipedia Resource  English version in XML  1,100,000 articles  Cut off date: Nov. 30, 2006 Domain Connected Branches  549,486 nodes for IT  549,433 nodes for biology

14 Evaluation on Scheme Selection Evaluation by sampling  For Top 20,000 nodes 10 nodes in every 1,000 nodes  For Remaining nodes 10 nodes in every 10,000 nodes Corpus size  Top 20,000 98M for IT 101M for Biology

15 Sampling Results of Different Schemes 12345678910111213141516171819202122232425 S1S1 700000000000700709801013 0 S2S2 979 1 009870 000 S3S3 9 8 1 7 8 000 123456789 111213141516171819202122232425 S1S1 0000000000000000000000102 S2S2 8 67674 08 90595 0000 S3S3 9996 97 36998920000 Table 1 Evaluation Result of Different Schemes in the IT Domain Table 2 Evaluation Result of Different Schemes in the Biology Domain

16 Overall Precision on sampled data Schemes Average IT Coverage Average Biology Coverage S1S1 19.0%0.0% S2S2 76.5%72.0% S3S3 92.5%86.5%

17 Root Node Identification Different root nodes leads to different classification structure  E.g. “Category: Electronics” For electronics For IT Compare to Library of American Congress Classification (LACC)  Widely used library classification in most research and academic areas

18 Comparisons with LACC Root Node ITBiology TermsRelationsTermsRelations LACC32283125 Wiki26232117 Domain Classification Tree 2120 15 Comparisons of Classification Trees with Root Nodes from Respective Domains

19 Comparisons with LACC (2) Root Node Electronics For ElectronicsFor IT TermsRelationsTermsRelations LACC34423228 Wiki30362623 Domain Classification Tree 2330142 Comparisons of Classification Tree Structures with LACC with Root Node: Electronics

20 Conclusion Acquire leave nodes through qualified classification tree branches in Wiki Best performance should take into consideration of both in-edges and out- edges Selection of proper nodes does affect the results  Pick the most common term as the root node

21 Future Works Improve Ranking Functions  Using page contents  Using hyperlinks in contexts of pages Set different parameters of weights to different domains

Thanks! Q & A

Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic.

Similar presentations

Presentation on theme: "Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic.

Similar presentations

Presentation on theme: "Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic."— Presentation transcript:

Similar presentations

About project

Feedback