Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

Similar presentations


Presentation on theme: "1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar."— Presentation transcript:

1 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar

2 2 Wikify! Linking Documents to Encyclopedic Knowledge. R. Mihalcea and A. Csomai Learning to Link with Wikipedia. D. Milne and I. H. Witten

3 3 What is Wikification Automatic keyword extraction Word sense disambiguation Automatically cross-reference documents (unstructured text) with wikipedia.

4 4 Wikify! - Introduction Introduces annotation of documents by linking them with Wikipedia Applications could be semantic web, educational applications, useful in no. of text processing problems. Previous similar works: Microsoft Smart Tags, Google AutoLink merely based on word or phrase lookup (no keyword extraction or disambiguation)

5 5 Wikify! - Text Wikification

6 6 Wikify! - Keyword Extraction Recommendations from Wikipedia style manual: link terms providing deeper understanding of topic, avoid linking unrelated terms, select proper amount of keywords. Unsupervised algorithms: Involve two steps –Candidate extraction: extract all possible n-grams. –Keyword ranking: Assign numeric value to each candidate. Used three methods - tf-idf,  2, Keyphraseness.

7 7 Wikify! - Evaluation of Keyword Extraction

8 8 Wikify! - Word Sense Disambiguation Ambiguity is inherent to human language Disambiguation algorithms: –Knowledge-based: rely exclusively on knowledge derived from dictionaries. –Data-driven: based on probabilities collected from sense-annotated data. Here voting scheme is used which seeks agreement between both. Wikify! provides highly precise annotation even if recall is lower.

9 9 Wikify! - Disambiguation Evaluation Word sense disambiguation results: total number of attempted (A) and correct (C) word senses, together with precision (P), recall (R) and F-measure (F) evaluations.

10 10 Wikify! - Overall Evaluation and Conclusion Wikify! allows user to upload a text file or accepts URL of webpage, processes the document provided by the user, and finally returns the wikified version of the document. The user also has option of providing density of keywords in the range 2%-10% default being 6%. When it was evaluated by human evaluators (20 users evaluating 10 documents each) only 57% of the cases were identified accurately (50% would be ideal case).

11 11 Learning to Link with Wikipedia Machine learning approach to identify significant terms within unstructured text. It can provide structured knowledge about any unstructured text. Uses Wikipedia articles as training data, which improves recall and precision.

12 12 Snapshot of Wikified document

13 13 Learning to Disambiguate Links Uses disambiguation to inform detection. Features such as Commonness and Relatedness of the term are used as measures to resolve ambiguity. Commonness of a sense is defined by number of times it is used by wikipedia articles as destination. Commonness = (No. of times term is used as link) / (No. of times term appears in Wikipedia articles)

14 14 Disambiguation (Continued) Relatedness is given by following formula: Where a and b are two articles of interest A and B are sets of all articles that link to a and b respectively, and W is set of all articles in Wikipedia.

15 15 Disambiguation (Continued) Commonness and Relatedness

16 16 Disambiguation (Continued) All context terms are not equally useful, so weight is assigned to each context term which is average of its link probability (i.e. commonness) and relatedness. All the above features are combined and the feature of context quality is defined as sum of the weights that are previously assigned to each context term. These features are used to train the classifier. To configure the classifier, parameter specifying minimum probability of sense is used.

17 17 Disambiguation Evaluation Disambiguation classifier was trained over 500 articles (instead of entire Wikipedia) on a modest desktop with 3 GHz dual Core processor and 4GB of RAM. Classifier was configured using 100 wikipedia articles. It was trained in 13 minutes, and tested in 4 minutes and another 3 minutes were required to load required summaries of Wikipedia’s link structure and anchor statistics into memory. To evaluate classifier, 11000 anchors were gathered from 100 random articles.

18 18 Disambiguation Evaluation

19 19 Learning to Detect Links Central difference between Wikify’s link detection approach and this new link detector: Wikify exclusively relies on link probability, whereas in this new approach, the context surrounding the terms is also taken into consideration. This link detector discards only terms having very low link probability so that nonsense phrases and stop words are removed.

20 20 Features used for Link Detection Link probability: It considers average link probability. Relatedness: semantic relatedness, average relatedness between each topic and all other candidates. Disambiguation Confidence Generality Location and Spread

21 21 Link Detector

22 22 Link Detector Performance Same dataset as for disambiguation classifier was used for training, configuration as well as evaluation. 6.5% link probability was set as recall and precision balance at that point. Link detector was trained on unambiguous terms.

23 23 Link Detector Performance (Continued)

24 24 Wikification in the Wild This system was tested using news articles instead of wikipedia and it gave 76.4% accuracy in link detection.

25 25 Conclusions This system resolves ambiguity as well as polysemy. Common hurdle in all such applications: they must somehow move from unstructured text to collection of relevant wikipedia articles. This paper has contibuted proven method for extracting key concepts from plain text. Finally these are attempts to explain and organize sum total of human knowledge.

26 26 Application on itself

27 27 Questions ?


Download ppt "1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar."

Similar presentations


Ads by Google