Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary

Similar presentations


Presentation on theme: "Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary"— Presentation transcript:

1 Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary
Torsten Zesch, Christof Müller, Iryna Gurevych | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 1 12. Mai 2008 |

2 Knowledge sources in NLP
NLP applications Information Extraction Information Retrieval Keyword Extraction Named Entity Recognition Question Answering Semantic Relatedness Text Categorization Text Summarization Word Sense Disambiguation For a long time knowledge sources have been used in NLP | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 2 12. Mai 2008 |

3 Wikipedia & Wiktionary in NLP
NLP applications Information Extraction (Ruiz-Casado et al., 2005) Information Retrieval (Gurevych et al., 2007) Named Entity Recognition (Bunescu & Pasca, 2006) Question Answering (Ahn et al., 2004) Text Categorization (Gabrilovich & Markovitch, 2006) Over 1.5 million entries in 171 language editions ( ) Another resource created in the same way like Wikipedia. Although it does not receive as much attention as Wikipedia, it contains a remarkable number of entries. NLP applications Semantic Relatedness (Zesch et al., 2008) NLP applications ??? | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 3 12. Mai 2008 |

4 Linguists Crowds Wisdom of Linguists vs. Wisdom of Crowds
High data quality Significant size Available for many languages Low construction costs Up-to-dateness API | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 4 12. Mai 2008 |

5 Wikipedia – A quickly growing resource
| Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 5 12. Mai 2008 |

6 Linguists Crowds Wisdom of Linguists vs. Wisdom of Crowds
High data quality Significant size Available for many languages Low construction costs Up-to-dateness API | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 6 12. Mai 2008 |

7 Wikipedia – A multi-lingual resource
~75% other languages To visualize that, a chart showing the amount of non-English entries in Wikipedia. English only Adapted from: Increasing amount of content in non-Top-10 languages | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 7 12. Mai 2008 |

8 Linguists Crowds Wisdom of Linguists vs. Wisdom of Crowds
High data quality Significant size Available for many languages Low construction costs Always up-to-date API available | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 8 12. Mai 2008 |

9 Outline Motivation Wikipedia Wiktionary Summary
| Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 9 12. Mai 2008 |

10 Wikipedia – Disambiguation pages
Sense inventory Domain specific senses Word Sense Disambiguation | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 10 12. Mai 2008 |

11 Wikipedia – Redirect pages
Synonyms Pope Benedict XVI Joseph Ratzinger Joseph Cardinal Ratzinger Spelling variations Benedict the Sixteenth Benedict the 16th Benedict 16th Benedict 16 Benedict XVI Benedict xvi Misspellings Josef Ratzinger (instead of Joseph) Abbreviations PB16 Named Entity Recognition Co-reference Resolution | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 11 12. Mai 2008 |

12 Wikipedia – Categories
Articles Hierarchy Engines Energy conversion Information Retrieval Semantic Relatedness Piston engines Aircraft piston engine Automobile engines Piston Engine Configurations | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 12 12. Mai 2008 |

13 Accessing Wikipedia NLP applications Application Programming Interface
large scale applications need efficient access Crawling + bot framework | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 13 12. Mai 2008 |

14 Challenge – Size of Wikipedia
| Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 14 12. Mai 2008 |

15 Challenge – Size of Wikipedia
how many books it would take to print the English Wikipedia Source: | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 15 12. Mai 2008 |

16 Accessing Wikipedia – Related work
NLP applications Application Programming Interface large scale applications need efficient access Crawling + bot framework XML dump Crawling Database server (Riddle, 2006) (Gabrilovich, 2007) (Shanks, 2005) (Summers, 2006) (Ponzetto & Strube, 2006) | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 16 12. Mai 2008 |

17 System Architecture Optimized database Optimized database
Data transformation One time effort Wikipedia dump Wikipedia dump Language 1 Language 2 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 17 12. Mai 2008 |

18 Data transformation Currently available optimized databases: English
German Czech Ukrainian Wikipedia Time Machine Tool Release in August 2008 Users can create their own databases Snapshot from a certain date is reconstructed using the revision history Dyachronic research | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 18 12. Mai 2008 |

19 System Architecture Java-based API (JWPL) Run-time Optimized database
WSD Inform. Retrieval Run-time Object relational mapping Optimized database Optimized database Data transformation One time effort Wikipedia dump Wikipedia dump Language 1 Language 2 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 19 12. Mai 2008 |

20 JWPL – Wikipedia API host database user password language Section Page
ParsedPage Paragraph Category Link Wikipedia Category Graph Table host database user password language All languages – given that a optimized dump is available. WikiTimeMachine Iterating over pages, categories Access to redirects MetaData ... | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 20 12. Mai 2008 |

21 JWPL – Code example DatabaseConfiguration dbConfig = new DatabaseConfiguration(); dbConfig.setDatabase("DATABASE"); dbConfig.setHost("SERVER_URL"); dbConfig.setUser("USER"); dbConfig.setPassword("PASSWORD"); dbConfig.setLanguage("LANGUAGE"); Wikipedia wiki = new Wikipedia(dbConfig); CategoryGraph cg = new CategoryGraph(wiki); Category c1 = wiki.getCategory("Germany"); Category c2 = wiki.getCategory("France"); int pathLength = cg.getPathLengthInNodes( c1, c2 ); | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 21 12. Mai 2008 |

22 Outline Motivation Wikipedia Wiktionary Summary
Wikipedia has been used in many applications. Why use Wiktionary? | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 22 12. Mai 2008 |

23 Wiktionary – Wikipedia‘s lexical companion
Language Etymology Pronunciation Part-of-speech Word senses Synonyms Derived Terms Translations Abbreviations, Antonyms, Categories, Collocations, Examples, Glosses, Hypernyms, Hyponyms, Morphology, Quotations, Related terms, Troponyms | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 23 12. Mai 2008 |

24 Wiktionary – Entry types
Acronyms Colloquial, slang or pejorative words Common misspellings basicly vs. basically Compounds Contractions o’ vs. of Disputed usage words (e.g. irregardless vs. regardless) Onomatopoeia (e.g. grr) Protologisms (e.g. iPodian) Simplified spelling variants (e.g. thru vs. through) | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 24 12. Mai 2008 |

25 Accessing Wiktionary – Related work
NLP applications Application Programming Interface large scale applications need efficient access Crawling + bot framework XML dump Crawling Database server | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 25 12. Mai 2008 |

26 System Architecture Optimized database Optimized database
Data transformation One time effort Wiktionary XML dump Wiktionary XML dump Language 1 Language 2 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 26 12. Mai 2008 |

27 Challenge – Information Extraction
English German | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 27 12. Mai 2008 |

28 MediaWiki Parser MediaWikiMarkup Parser
Needs to be adapted to each language Currently available English German | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 28 12. Mai 2008 |

29 Java-based API (JWKTL)
System Architecture Sem. Relatedness WSD Run-time Object retrieval Optimized database Optimized database Data transformation One time effort Wiktionary XML dump Wiktionary XML dump Language 1 Language 2 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 29 12. Mai 2008 |

30 JWKTL – Wiktionary API database location language Synonyms Sense
Translations PoS Etymology Wiktionary Wiktionary Word Language Pronunciation database location language Es gab schon vorher Encyclopedia und Dictionaries. Jede dieser Information kann auch in klassischen von Linguisten bereitgestellten Knowledge Bases vorhanden sein. Aber CKBs (Wikipedia/Wiktionary) haben Vorteile Aber es gibt auch Challenges ... | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 30 12. Mai 2008 |

31 JWKTL – Code example Wiktionary w = new Wiktionary(DB_PATH, Language.ENGLISH); List<WiktionaryWord> wordList = w.getWords("bank"); WiktionaryWord word = wordList.get(0); PartOfSpeech pos = word.getPartOfSpeech(); String gloss = word.getGloss(0); List<String> hyponyms = word.getRelatedTerms(Relation.HYPONYMY, 0); | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 31 12. Mai 2008 |

32 Outline Motivation Wikipedia Wiktionary Summary
| Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 32 12. Mai 2008 |

33 Linguists Crowds Wisdom of Linguists vs. Wisdom of Crowds
High data quality Significant size Available for many languages Low construction costs Always up-to-date API available | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 33 12. Mai 2008 |

34 Summary Wikipedia and Wiktionary are excellent knowledge sources for NLP Access is challenging due to size and semi-structured content APIs provide efficient access enabling large-scale NLP tasks Easy to use object-oriented Java programming interface Fine-grained access to structural elements and contained knowledge Freely available for research purposes Wiktionary API (JWKTL) will be available in late June CategoryGraph Page Category Wikipedia ParsedPage Section Paragraph Link Table ... MetaData What you do with it is up to you. Has already been used in many projects. | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 34 12. Mai 2008 |

35 Ubiquitous Knowledge Processing Lab
Acknowledgments Ubiquitous Knowledge Processing Lab JWKTL will be available soon (July 2008). Examples and Demo JWPL | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Prof. Dr. Iryna Gurevych | 35 12. Mai 2008 |


Download ppt "Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary"

Similar presentations


Ads by Google