Presentation is loading. Please wait.

Presentation is loading. Please wait.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Similar presentations


Presentation on theme: "Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad."— Presentation transcript:

1 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad Hoc bilingual Hungarian → English joint work with András A. Benczúr, István Bíró, Károly Csalogány Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Péter Schönhofen

2 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Our Approach Term-by-term query translation by dictionaries Bigram language model helps select the most probable English translation Using Wikipedia to discard off-topic terms IR System: Hungarian Academy of Sciences Search Engine ( http://search.sztaki.hu )  TF×IDF-based  OR query, heavily weighted by # matched terms  Also taking into account proximity and term location Use only query title; description and narrative contributes to mapping title to Wikipedia concepts

3 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Outline of the algorithm Preparations  construct a dictionary  generate concept network from Wikipedia  pre-process queries and documents Raw translation  disambiguation with bigram model Improve translation quality with Wikipedia  map terms to concept space  rank concepts  map concepts to words

4 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Dictionary Construction Two sources of Hungarian-English term pairs:  On-line dictionary of the Institute (official + community edited entries) ‏  cross-language links present in Wikipedia Select conflicting entries in above order (official, community, Wikipedia) 100,510 dictionary entries in total (however, large portion is idiom) ‏

5 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Raw translation Find Hungarian dictionary terms in queries  Hungarian terms may overlap Select best translations based on bigram model  a translation is better if it joins to other translations through bigrams with higher probability  Wikipedia model used but any other large corpus suffices query Hungarian word Translation candidate 1 score by bigram model Translation candidate 2 Translation candidate 1 Translation candidate 2 max

6 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Role of Wikipedia

7 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Concept network Regular Wikipedia articles represent concepts  article title is concept name  links to other articles describe semantic relations  redirections are handled as additional concept names (sort of synonyms) Category assignments are ignored Wikipedia is in fact converted to an ontology  less formal than a proper ontology (e. g. WordNet)  only one type of relationship exists

8 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Map terms to concepts Match Wikipedia article titles with query terms C oncepts behind Wikipedia article titles :  the same title may represent multiple concepts  another layer of disambigu ation is introduced Concepts are recognized through terms, and are carried by text locations occupied by the term

9 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Rank concepts Select concepts which are the most tightly connected to other candidate concepts Score of concept C computed from three factors:  L: # text locations carrying concepts semantically related to C;  M: # concepts carried by the same text locations as C;  F: # text locations carrying C

10 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Map concepts to words 1.Concepts → titles (word sequences) pasting titles would yield too long queries 2.Titles → set of words 3.Words are ranked based on the scores of concepts behind them the same word may represent many concepts 4.Query title words required if all translations of a title word discarded, forcefully injected into the translated query

11 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Why use Wikipedia? Advantages  freely available (snapshots are downloadable) ‏  relatively high-quality  wide range of subjects covered  rapidly growing, up-to-date Disadvantages  articles not always link to other relevant articles  category assignments not always consistent  basic verbs and nouns are not covered

12 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Example query Original query title: “cancer research” Raw translation: “oncology” Improved translation: “oncology cancer treatment”

13 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Evaluation

14 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Difficulties Hungarian stemmer is not perfect  language is complex  pronouns not always recognized as such Dictionary is small In short: raw translation is of very low quality Retrieval is not performed on the concept level Context is not large enough to support the reliable selection of relevant Wikipedia concepts

15 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Future work Performing German queries against English corpora More rich dictionary Improved mechanism  raw translation is used for retrieval  Wikipedia concept network is used for determining relevance of documents in hit-lists: query-document matching carried out in the space of Wikipedia concepts Improved matching  POS information also taken into account

16 Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Thank you for your attention


Download ppt "Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad."

Similar presentations


Ads by Google