Presentation is loading. Please wait.

Presentation is loading. Please wait.

WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Similar presentations


Presentation on theme: "WP 10 Multilingual Access Philipp Daumke, Stefan Schulz."— Presentation transcript:

1 WP 10 Multilingual Access Philipp Daumke, Stefan Schulz

2 Multilingual Access - Rationale English as First Language English as Second Language No English Language Skills English as a Foreign Language < 70 % of the world's scientists read in English 80 % of the world's electronically stored information is in English 90 % English articles in Medline (2000) Sources: The British Council, 2005 Fung ICH: Open access for the non-English-speaking world: overcoming the language barrier. Emerging Themes in Epidemiology, 2008

3 Non-native speakers Broad range of command of English Reading skills > writing skills Reduced active vocabulary Difficulty in formulating precise queries English as Second Language English as a Foreign Language

4 Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example

5 Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example

6 Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example

7 BootStrep WP 10 - Multilingual access Objectives: –To provide a multilingual search interface to the BootStrep Biolexicon / Bioontology –We do NOT propose to deliver a multilingual extension of the BootStrep biolexicon Query Languages: French, German, English, (Italian) Output language: English Method: Subword-based semantic indexing Resources: –MorphoSaurus multilingual subword lexicon & thesaurus –MorphoSaurus Semantic Indexer

8 Technique: Morphosemantic Indexing Subword-based, multilingual semantic indexing for document retrieval Subwords are atomic, conceptual or linguistic units: –Stems: stomach, gastr, diaphys –Prefixes: anti-, bi-, hyper- –Suffixes: -ary, -ion, -itis –Infixes: -o-, -s- Equivalence classes contain synonymous subwords and their translations: –#derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel, … } –#inflamm = { inflamm, -itic, -itis, -phlog, entzuend, -itis, -itisch, inflam, flog, inflam, flog,... }

9 Segmentation: Myo | kard | itis Herz | muskel | entzünd |ung Inflamm |ation of the heart muscle muscle myo muskel muscul inflamm -itis inflam entzünd Eq Class subword herz heart card corazon card INFLAMM MUSCLE HEART Subword Thesaurus Structure Indexation: #muscle #heart #inflamm #heart #muscle #inflamm #inflamm #heart #muscle Thesaurus: ~21.000 equivalence classes (MIDs) Lexicon entries: –English:~23.000 –German:~24.000 –Portuguese: ~15.000 –Spanish:~11.000 –French:~ 8.000 –Swedish:~10.000 –Italian:~ 4.000

10 Indexing Pipeline

11

12

13

14 Subword-based document transformation Morphosemantic indexer

15 Subword-Based Search Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter

16 Subword-based query transformation Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter

17 Adapting Morphosemantic Indexing of BootStrep BootStrep terminology mostly disjoint from existing clinical terminology Enhancement of data resources (e.g. for acronym resolution, multi-term equivalences) BootStrep Terms for multilingual access –Gene Ontology, InterPro, IntAct, Gene Regulation Ontology, Species Medline subcorpus (about E. coli gene regulation)

18 Ongoing/Completed Tasks Manual Training of MorphoSaurus-Lexica by means of the BootStrep corpora (en, de, fr) Multilingual Terminology Browser –2268 GO terms + translations –6925 InterPro terms + translations –2082 IntAct terms + translations –URL: http://www.medinf.uni-freiburg.de/demo/BootStrepBrowser/ Multilingual Search Engine: –Document collection: BootStrep-Medline subset –Languages: English, German, French –Query modes: Author, Title, title + keywords, All

19 Terminology Browser Search Results Further Information Navigation

20 Terminology Browser

21 Multilingual Search Engine

22 To do: Tools and Resources BootStrep-Browser –Integration of Species –Integration of the Gene Regulation Ontology Multilingual Search Engine –Multilingual treatment of acronyms –Inclusion of species synonym list –Dealing with mixed queries (German-English, English-French) –Integration with the fact store Continue lexicon population –Italian terms ?

23 To do: Evaluation Creation of a gold standard –Typical English queries –Find all relevant documents in the E.coli subset CLIR experiments –Translate queries to French and German –Compare mean average precision Reuse of already existing routines on standard benchmarks (OHSUMED, IMAGEClef)

24 ImageCLEFMed Benchmark Baseline: monolingual –Stemmed English queries –Stemmed English texts Query translation –Google translator –Multilingual dictionary compiled from UMLS Morphosemantic Indexing –Interlingual representation of user queries and documents Morphosemantic Indexing –incorporating disambiguation module English German Portuguese Spanish French Swedish Average Percent of Baseline Top 20 Average Precision


Download ppt "WP 10 Multilingual Access Philipp Daumke, Stefan Schulz."

Similar presentations


Ads by Google