Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interconnecting lexicographic resources. In search for a model Dan Cristea “Alexandru Ioan Cuza” University of Iași Institute of Computer Science of the.

Similar presentations


Presentation on theme: "Interconnecting lexicographic resources. In search for a model Dan Cristea “Alexandru Ioan Cuza” University of Iași Institute of Computer Science of the."— Presentation transcript:

1 Interconnecting lexicographic resources. In search for a model Dan Cristea “Alexandru Ioan Cuza” University of Iași Institute of Computer Science of the Romanian Academy dcristea@info.uaic.ro

2 Topics Why would one want to connect linguistic resources? Parameterising the needs Standardisation helps interconnecting A bunch of notorious resources How would this work? Final remarks COST-ENeL, Bled, 29-30 September 2014

3 Why would one want to connect linguistic resources? Use case 1: 100 Romanian dictionaries aligned – CLRE. Essential Romanian Lexicographical Corpus. 100 dictionaries aligned at entry and, partially, sense levels (2010 2013, at Institute A.Philippide of the Romanian Academy, in Iai – CLRE. Essential Romanian Lexicographical Corpus. 100 dictionaries aligned at entry and, partially, sense levels (2010 –2013, at Institute A.Philippide of the Romanian Academy, in Iai dictionaries’ list at: http://85.122.23.90/resurse/Lista-dictionarelor.doc written in 3 types of alphabets: Cyrillic, transition and Latin large diversity of formatting styles Bucharest Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014

4 CLRE Essential Romanian Lexicographical Corpus Bucharest Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014

5 ISER The CLRE project Bucharest Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014

6 Petri The CLRE project Bucharest Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014

7 DN II The CLRE project Bucharest Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014

8 Bălăşescu The CLRE project Bucharest Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014

9 Lexicon Militar The CLRE project Bucharest Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014

10 Dicționar de informatică The CLRE project Bucharest Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014

11 Scanning OCR – Abby Fine Reader 9 Parsing entries => XML Manual verification Indexing and alignment Processing in CLRE Iai Iai, 25-26 September 2013 COST-ENeL, Bled, 29-30 September 2014

12 CLRE – manual verification Iai Iai, 25-26 September 2013 COST-ENeL, Bled, 29-30 September 2014

13 Why would one want to connect linguistic resources? Use case 2: align WN with an explanatory dictionary – a WN synset: pos (def, ex, w 1 s1 …w k sk … w n sn ) – an explanatory dictionary entry: w k, pos, … … COST-ENeL, Bled, 29-30 September 2014

14 Why would one want to connect linguistic resources? Use case 2: align WN with an explanatory dictionary – synsets of a word w k, pos: (def 1, ex 1, …w k s1 …) … (def k, ex k, …w k sk …) … (def m, ex m, …w k sm …) – the explanatory dictionary entry of the word w k, pos: w k, pos, … … COST-ENeL, Bled, 29-30 September 2014

15 Why would one want to connect linguistic resources? Use case 2: align WN with an explanatory dictionary – a WN synset: pos (def, ex, w 1 s1 …w k sk … w n sn ) – explanatory dictionary entries: w 1, pos, … … w k, pos, … … w n, pos, … … COST-ENeL, Bled, 29-30 September 2014

16 Why would one want to connect linguistic resources? Use case 3: the TOT problem or the forgotten word Michael Zock COST-ENeL, Bled, 29-30 September 2014

17 The first thought: standardisation Lexical Markup Framework (LMF) – What is it? a common model for creation and use of lexical resources – With what goal? to manage the exchange of data between and among these resources to enable the merging of a large number of individual electronic resources to form extensive global electronic resources COST-ENeL, Bled, 29-30 September 2014

18 Near-standard Text Encoding Initiative (TEI) – What is it? an inventory of the features most often deployed for computer-based text processing recommendations about suitable ways of representing these features – With what goal? to facilitate processing by computer programs to facilitate the loss-free interchange of data amongst individuals and research groups using different programs, computer systems, or application software COST-ENeL, Bled, 29-30 September 2014

19 Standardisation Text Encoding Initiative (TEI) – Example of a dictionary entry serialisation (from TEI Guidelines) disproof dIs"pru:f n facts that disprove something. the act of disproving. disproof (dIs"pru:f) n. 1. facts that disprove something. 2. the act of disproving. CED COST-ENeL, Bled, 29-30 September 2014

20 LMF and TEI content and… discontent The TEI format may be used as an interchange format, permitting sharing of resources even when their local encoding schemes differ. Both LMF and TEI model lexical material at a deep representational detail… COST-ENeL, Bled, 29-30 September 2014

21 LMF and TEI content and… discontent TEI intention: – guidance for individual or local practice in text creation and data capture – support of data interchange – support of application-independent local processing Opening good possibilities of querying But how would function the interconnection?... COST-ENeL, Bled, 29-30 September 2014

22 Parameterising the needs If I want to connect two resources, simply merge the contents Then be able to interrogate the merged resource by taking advantage of peculiarities in each resource COST-ENeL, Bled, 29-30 September 2014

23 Parameterising the needs Able to represent variations in word forms, alternate orthography, diachronic morphology Easy navigation by applying various filtering criteria COST-ENeL, Bled, 29-30 September 2014

24 Parameterising the needs Very often lexicographic data is hierarchical – for instance, a sense of a dictionary entry contains a definition, examples, but also sub-senses Organise even recursive searches – give me the definition neighbouring sphere of depth 2 of the word captain (take all senses of the entry captain and form the list of words in the corresponding definitions, then for each of them take all their senses and collect again words in their definitions) COST-ENeL, Bled, 29-30 September 2014

25 The idea Representing lexical information as feature structures centred on word’s lemmas – disproof (dIs"pru:f) n. 1. facts that disprove something. 2. the act of disproving. CED [lemma=disproof, entry=[pron=dIs"pru:f, pos=n, sense=[n=1, def=facts that disprove something], sense=[n=2, def=the act of disproving], res=CED]] COST-ENeL, Bled, 29-30 September 2014

26 Representing lexical entries as feature structures lemma=disproof entry= pron=dIs"pru:f pos=n sense= res=CED n=1 def=facts that disprove something n=2 def=the act of disproving Cambridge English Dictionary COST-ENeL, Bled, 29-30 September 2014

27 Representing lexical entries as feature structures Graph representation lemma disproof entry pron dIs"pru:f n pos 1 sense n facts that disproves smth def 2 n the act of disproving def res CED COST-ENeL, Bled, 29-30 September 2014

28 Representing lexical entries as feature structures lemma=disproof entry= pron=dIs"pru:f pos=n sense= res=MWCD n=2 def=evidence that disproves n=1 def=the action of disproving Merriam-Webster’s Collegiate Dictionary COST-ENeL, Bled, 29-30 September 2014

29 Representing lexical entries as feature structures Graph representation lemma disproof entry pron dIs"pru:f n pos 1 sense n the action of disproving def 2 n evidence that disproves def res MWCD COST-ENeL, Bled, 29-30 September 2014

30 How could lexical entries be merged? Entries of the same word from different dictionaries lemma disproof entry pron dIs"pru:f n pos 1 sense n facts that… def 2 n the act of… def res CED lemma disproof entry pron dIs"pru:f n pos 1 sense n the action of… def 2 n evidence that… def res MWCD COST-ENeL, Bled, 29-30 September 2014

31 Merging lexical entries Distinct parts lemma disproof entry pron dIs"pru:f n pos 1 sense n facts that… def 2 n the act of… def res CED lemma disproof entry pron dIs"pru:f n pos 1 sense n the action of… def 2 n evidence that… def res MWCD COST-ENeL, Bled, 29-30 September 2014

32 Merging lexical entries lemma disproof entry pron dIs"pru:f n pos 1 sense n facts that… def 2 n the act of… def res CED 1 sense n the action of… def 2 n evidence that… def res MWCD X (new) COST-ENeL, Bled, 29-30 September 2014

33 Representation of the merged feature structure lemma=disproof entry= pron=dIs"pru:f pos=n X= n=1 def=facts that disprove something n=2 def=the act of disproving sense= res=CE D n=1 def=the action of disproving n=2 def=evidence that disproves sense= res=MWCD COST-ENeL, Bled, 29-30 September 2014

34 The WordNet search for disproof COST-ENeL, Bled, 29-30 September 2014

35 Feature structures representation for the WN synsets of disproof lemma disproof synsets pos n (*, falsification, refutation) synset (falsification, falsifying, *, refutation, refutal) synset lex glossany evidence that helps to establish the falsity of something gloss (the act of determining that something is false COST-ENeL, Bled, 29-30 September 2014

36 The WordNet search for discount COST-ENeL, Bled, 29-30 September 2014

37 Representing WordNet synsets lemma discount synsets pos n (*, price reduction, deduction) synset (discount rate, *, bank discount) synset lex synset glossthe act of reducing the selling price of merchandise glossinterest on an annual basis deducted in advance on a loan synsets pos v synset (dismiss, disregard, brush aside, brush off, *, push aside, ignore) lex glossbar from attention or consideration … … … ex “She dismissed his advances" COST-ENeL, Bled, 29-30 September 2014

38 How could dictionary entries be merged with WN synsets? lemma disproof synsets pos n (*, falsification, refutation) synset (falsification, falsifying, *, refutation, refutal) synset lex glossany evidence that helps to establish the falsity of something gloss (the act of determining that something is false lemma disproof entry pron dIs"pru:f n pos 1 sense n facts that disproves smth def 2 n the act of disproving def res CED COST-ENeL, Bled, 29-30 September 2014

39 Merging a dictionary entry with a WN entry lemma disproof entry pron dIs"pru:f n pos 1 sense n facts that disproves smth def 2 n the act of disproving def res CED lemma disproof synsets pos n (*, falsification, refutation) synset (falsification, falsifying, *, refutation, refutal) synset lex glossany evidence that helps to establish the falsity of something gloss (the act of determining that something is false COST-ENeL, Bled, 29-30 September 2014

40 synsets pos n (*, falsification, refutation) synset (falsification, falsifying, *, refutation, refutal) synset lex glossany evidence that helps to establish the falsity of something gloss (the act of determining that something is false lemma disproof entry pron dIs"pru:f n pos 1 sense n facts that disproves smth def 2 n the act of disproving def res CED Merging a dictionary entry with a WN entry COST-ENeL, Bled, 29-30 September 2014

41 Going one step further Feature structures are hierarchical data Codd: hierarchical data can be represented as relational tables – Codd, E.F. (June 1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM 13 (6): 377–387. COST-ENeL, Bled, 29-30 September 2014

42 Representing feature structures as relational tables lemma w entry pron … n pos 1 sense n … def res … cit … orth auth … … … … from http://en.wikipedia.org/wiki/Relational_database yr COST-ENeL, Bled, 29-30 September 2014

43 Representing feature structures as relational tables lemma w entry pron … n pos 1 sense n … def res … cit … orth auth … … … … idlemmaentry yr from http://en.wikipedia.org/wiki/Relational_database COST-ENeL, Bled, 29-30 September 2014

44 Representing feature structures as relational tables lemma w entry pron … n pos 1 sense n … def res … entrypronpossenseres cit … orth auth … … … … idlemmaentry yr from http://en.wikipedia.org/wiki/Relational_database COST-ENeL, Bled, 29-30 September 2014

45 Representing feature structures as relational tables lemma w entry pron … n pos 1 sense n … def res … entrypronpossenseres cit … orth auth … … … … sensendefcit idlemmaentry yr from http://en.wikipedia.org/wiki/Relational_database COST-ENeL, Bled, 29-30 September 2014

46 Representing feature structures as relational tables lemma w entry pron … n pos 1 sense n … def res … entrypronpossenseres cit … orth auth … … … … sensendefcit orthauthyr idlemmaentry yr from http://en.wikipedia.org/wiki/Relational_database COST-ENeL, Bled, 29-30 September 2014

47 Relational operators Projection: π a1,…an (R) => a relation containing only values of attributes a1,… an from the relation R Selection: σ φ (R), with ϕ is logical condition => only tuples verifying the condition ϕ are retained from the relation (or the set) R Join: R ✜ S => the set of all attributes in R and S that are equal on their common attributes Union: RŮS => a table representing the union of the two relations COST-ENeL, Bled, 29-30 September 2014

48 Interrogating a dictionary Citations before 1850 of the entry “symphony”. lemma w entry pron … n pos 1 sense n … def res cit … orth yr auth … … … … π orth (σ lemma=“symphony” & yr<1850 (WORD ✜ ENTRY ✜ SENSE ✜ CIT)) COST-ENeL, Bled, 29-30 September 2014

49 Interrogating a combination between a dictionary and a wordnet All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically. lemma w entry pron … n pos 1 sense n … def res cit … orth yr auth … … … … COST-ENeL, Bled, 29-30 September 2014

50 Interrogating a combination between a dictionary and a wordnet All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically. lemma w entry pron … n pos 1 sense n … def res cit … orth yr auth … … … … The citations of the title word w: π orth (σ lemma=w & yr<1850 (WORD ✜ ENTRY ✜ SENSE ✜ CIT)) COST-ENeL, Bled, 29-30 September 2014

51 Interrogating a combination between a dictionary and a wordnet All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically. lemma w entry pron … n pos 1 sense n … def res cit … orth yr auth … … … … Unify and lemmatise words belonging to the citations: lem(U(π orth (σ lemma=w & yr<1850 (WORD ✜ ENTRY ✜ SENSE ✜ CIT)))) COST-ENeL, Bled, 29-30 September 2014

52 Interrogating a combination between a dictionary and a wordnet All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically. π lex (σ pos=n & lemma  lem(U(πorth(σlemma=w & yr<1850 (WORD ✜ ENTRY ✜ SENSE ✜ CIT))) (WORD ✜ SYNS ✜ SYN)) lemma w entry pron … n pos 1 sense n … def res cit … orth yr auth … … … … synsets pos n synset lex gloss … … COST-ENeL, Bled, 29-30 September 2014

53 Conclusions I did not propose any model, I simply made some observations (nothing is really new) Linking lexicographic resources: – one resource => TEI representation => as feature structures => hierarchical graphs => relational tables – more resources => unifications of tables – use query and relational operators for interrogation COST-ENeL, Bled, 29-30 September 2014

54 Discussion Only a sketch – a lot of details should still be filled in – the good news: XML structures (the native language of TEI) accept direct representations as database records: XSLT => opening direct access to a complex querying language: XQuery => mimicking the relational operators and adding more facilities COST-ENeL, Bled, 29-30 September 2014

55 Discussion Another good news – representing variable depth structures – recursive hierarchies: Kamfonas Fixed depth dimensions are simpler to implement, maintain and query… Hierarchies that have variable depth or an uncertain number of levels… can often benefit if implemented as recursive hierarchies. http://www.kamfonas.com/id3.html http://www.kamfonas.com/id3.html COST-ENeL, Bled, 29-30 September 2014

56 Discussion Even more good news (hopes) – interrogations can be formulated in natural language => an interpreter translates them in the query language of a DBMS system – as such, a handy tool at the benefit of lexicographers COST-ENeL, Bled, 29-30 September 2014

57 Acknowledgements Work partially supported by the project The Computational Representative Corpus of Contemporary Romanian Language, a project of the Romanian Academy and partially by the COST-ENeL project I thank Isabelle Tamba and M ă d ă lin P ă trașcu for the slides describing the CLRE project COST-ENeL, Bled, 29-30 September 2014

58 Thank you! COST-ENeL, Bled, 29-30 September 2014


Download ppt "Interconnecting lexicographic resources. In search for a model Dan Cristea “Alexandru Ioan Cuza” University of Iași Institute of Computer Science of the."

Similar presentations


Ads by Google