Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shyrokov Volodymyr, Bugakov Oleg Krygin Maxim, Sydorchuk Nadiia Ukrainian Lingua-Information Fund NASU Ukrainian National Linguistic Corpus and its application.

Similar presentations


Presentation on theme: "Shyrokov Volodymyr, Bugakov Oleg Krygin Maxim, Sydorchuk Nadiia Ukrainian Lingua-Information Fund NASU Ukrainian National Linguistic Corpus and its application."— Presentation transcript:

1 Shyrokov Volodymyr, Bugakov Oleg Krygin Maxim, Sydorchuk Nadiia Ukrainian Lingua-Information Fund NASU Ukrainian National Linguistic Corpus and its application

2 The main results of theoretical studies and an overview of practical implementations received in ULIF-NASU are presented in the collective monograph “Corpus linguistics” Корпусна лінгвістика / Широков В.А., Бугаков О.В., Грязнухіна Т.О., Костишин О.М., Кригін М.Ю., Любченко Т.П., Рабулець О.Г., Сидоренко О.О., Сидорчук Н.М., Шевченко І.В., Шипнівська О.О., Якименко К.М. – К. – Довіра, – 471 с.

3 UNLC statistics General Corpus 4868 storage objects; 4868 storage objects; 1013 MB of the texts for indexing 1013 MB of the texts for indexing more than 62 mln tokens; more than 62 mln tokens; Legislation Corpus 5757 storage objects; 5757 storage objects; 151 MB of the texts for indexing 151 MB of the texts for indexing more than 18 mln tokens; more than 18 mln tokens;

4 Technological principles for creating UNLC Design and organization of the information architecture and functionality of UNLC is performed on the systems engineering of the virtual lexicographic laboratories. In accordance with the concept of virtual lexicographic laboratories, UNLC is designed using Service-Oriented Architecture (SOA) and Web-service technology. The Internet infrastructure is used as a communication infrastructure. The following technology standards are used: XML for data description; SOAP for exchange of the structured messages in the distributed systems; WSDL for service description; UDDІ for storing and providing the WSDL-descriptions on request. Windows Communication Foundation (WCF) is used for interaction between different levels of UNLC. It is a service-oriented system for data and message exchange that provides to the software components an opportunity to interact locally or remotely via a simplified unified software model of the cross-platform interaction. The necessary condition for bundled software functioning is the availability of high-powered means of security and data integrity.

5 The general scheme of linguistic corpus L_CL_C E_LIBE_LING MC_BIndexG_OG_OB_D MDI E_LIB – bibliographic subsystem (electronic library); E_LING – linguistic subsystem; MDI – subsystem for constructing the multidimensional index; Index – multidimensional index base. This item represents the database of results of MDI work; MC_B – microcontext base. This item is virtual and dynamically generated on user’s query. It returns a set of microcontexts that match a search prescription the user made.

6 Bibliographic subsystem serves as a multipurpose information system that accumulates information of different kinds: serves as a tool to collect, store, model and use the natural language information in the digital form. The generalized objects for storing in the bibliographic system may be the objects in the electronic form in any data format. This enables providing manuscripts, audio, video and other multimedia information besides usual printed texts to the library. Functions of the bibliographic subsystem forming a brief bibliographic description on the rules of bibliographing based on the metadata elements of the storage object recorded in the database; forming a brief bibliographic description on the rules of bibliographing based on the metadata elements of the storage object recorded in the database; forming a detailed bibliographic description of the storage object; forming a detailed bibliographic description of the storage object; editing the metadata set for a bibliographic description in accordance with the changes made by a bibliographer editing the metadata set for a bibliographic description in accordance with the changes made by a bibliographer analysis of changes in the bibliographic record; analysis of changes in the bibliographic record; work with the file system objects; work with the file system objects; editing, inserting, deleting profiles, specifications, vocabularies and their elements. editing, inserting, deleting profiles, specifications, vocabularies and their elements.

7 The user selects a search box of the boxes included in the search profile independently. If this is a text box, the user enters information, if the box has a limited set of values, the user selects the search value from a dictionary. The user selects a search box of the boxes included in the search profile independently. If this is a text box, the user enters information, if the box has a limited set of values, the user selects the search value from a dictionary. For the advanced search the combinations of logic operators “and” and “or” are used. For the advanced search the combinations of logic operators “and” and “or” are used. Search results are presented as a list of bibliographic descriptions. Search results are presented as a list of bibliographic descriptions. The user can view a complete list of bibliographic parameters for each object, view a resource (the full text) and record the search results into the file. The user can view a complete list of bibliographic parameters for each object, view a resource (the full text) and record the search results into the file. Search by the bibliographic parameters

8

9

10 Linguistic corpus provides the full-text information processing and serves as a tool for retrieving the contexts on users’ search queries taking into account certain linguistic parameters Functions of the linguistic subsystem creating the full-text index; creating the full-text index; purifying the full-text index; purifying the full-text index; adding the indexing object; adding the indexing object; indexing objects; indexing objects; removing an indexed object from the full-text index; removing an indexed object from the full-text index; the full-text search of the words and phrases in all sources, or sources selected by the bibliographic description, with the ability to set the distance between the search words; the full-text search of the words and phrases in all sources, or sources selected by the bibliographic description, with the ability to set the distance between the search words; providing statistics; providing statistics; viewing the microcontexts; viewing the microcontexts; recording the microcontexts of the words and phrases into the file; recording the microcontexts of the words and phrases into the file; service functions of servicing. service functions of servicing.

11 Marking the structural elements   Structuring by the text settings – “section”, “part”, “paragraph”, “title”, “conclusions”, “summary”, “abstract”.   Marking the paragraphs.   Marking the words written in the letters of not Ukrainian alphabet.   Structuring the text by the sentences pointing out the beginning and end for each one.   Marking the text words, the grammatical codes of which are defined by special rules. This concerns:   а) the words with a hyphen, the first part of which is an abbreviation of the Ukrainian and Latin uppercase letters;   б) abbreviations;   в) the proper names unambiguously identified by the context   Marking the non-author text (quotes, direct speech).   Identifying the text units that have no morphological status and are not interpreted with the rules of morphological analyzer.   Marking the words or text fragments written with interspacing.   Marking places in the text that need to be edited later.

12 Search by the linguistic parameters is realized due to the full-text index. The user enters a search phrase, sets the desired maximum number of words between the search ones and selects additional full-text search options, namely: search in a certain subset of objects; search in a certain subset of objects; use of the; use of the word order; use of thedistance between words; use of the distance between words; use of thelemmatization; use of the lemmatization; use of thesynonymy. use of the synonymy. The result of the full-text search is a list of bibliographic descriptions. But unlike the bibliographic search, the user gets direct access to each localization of the search item in the text, ie to all the contexts that contain the search item. Choosing a source the user can view contexts where the search item is highlighted in red. The size (length) of the context can be changed.

13

14

15 For further processing all the contexts, or contexts of a certain source can be recorded into the html-file specifying the source context, the time of creation, and search phrases.

16 Applying UNLC The source base of the linguistic information to create a fundamental academic lexicographic multivolume system “Ukrainian Language Dictionary”; The source base of the linguistic information to create a fundamental academic lexicographic multivolume system “Ukrainian Language Dictionary”; The database for linguistic research to identify new linguistic phenomena and formalize the existing ones; The database for linguistic research to identify new linguistic phenomena and formalize the existing ones; The system for grammatical marking; The system for grammatical marking; Statistical analysis of the text data; Statistical analysis of the text data; The environment of accumulation and processing of the information objects of different nature; The environment of accumulation and processing of the information objects of different nature; The environment of interaction with the systems of grammar, synonymic and explanatory dictionaries. The environment of interaction with the systems of grammar, synonymic and explanatory dictionaries. Creation of different linguistic and information systems (LIS) by the corpus technologies: LIS “The Constitution of Ukraine”; LIS “T. G. Shevchenko Electronic Encyclopedia” Creation of different linguistic and information systems (LIS) by the corpus technologies: LIS “The Constitution of Ukraine”; LIS “T. G. Shevchenko Electronic Encyclopedia” Linguistic expertises Linguistic expertises

17 The explanatory “Ukrainian Language Dictionary”

18 Editing system of the dictionary entry

19 LIS “The Constitution of Ukraine” 19

20 T. G. Shevchenko Electronic Encyclopedia

21 LIS “Haidamaks”

22 Linguistic expertise The principle of applying statistical methods in the linguistic expertise: Text  preliminary processing  statistical portrait  parameters of comparison or analysis  analysis  result. The program for research of the students’ works on plagiarism forms a linguistic corpus of abstracts forms a linguistic corpus of abstracts compares any text with abstracts from the corpus by various criteria compares any text with abstracts from the corpus by various criteria creates and visualizes the result of comparison creates and visualizes the result of comparison 22

23 The window of the linguistic expertise program

24 Selecting topics for comparison

25 The result of text analysis When comparing the abstract text with the texts from the corpus of abstracts by one of the criteria, the two texts were found, which match the observable abstract on 63 and 53% respectively.

26 Visualization of the program work results

27 Comparison of the texts of the 20-volume and 11-volume explanatory dictionaries

28 The analysis of the political parties’ platforms

29 The concordance statistics

30 The most frequent lexemes in the programs of parties (blocs)

31

32 Relative intensities of the key concepts in the election programs of the political parties in 2002

33 Disambiguation in the text using statistical methods Lexical homonymy КОСА КОСА 1. Заплетене волосся 2. Сільськогосподарське знаряддя для косіння трави, збіжжя тощо, що має вигляд вузького зігнутого леза, прикріпленого до держака 3. Вузька намивна смуга суходолу в морі, річці тощо, сполучена одним кінцем із берегом Grammatical homonymy Grammatical homonymyПРАВ 1. право – іменник середнього роду, родовий відмінок, однина 2. правити – дієслово доконаного виду, наказовий спосіб, друга особа, однина 3. прати – дієслово недоконаного виду, минулий час, чоловічий рід, однина

34 The scheme of the disambiguation algorithm Manual marking of the initial training text T 0 : receiving marking М(T 0 ) Receiving statistics of the grammatical chains S 0 Disambiguation by the statistical method in the training text T i (receiving marking) Control of the received marking by the specialist, corrective actions, additional marking М(T i ) Disambiguation in the text of a certain genre Combining statistics S i and S i-1 Receiving statistics S i

35 T i ={(w 1 )r 1 (w 2 )r 2 (w 3 )…(w N )}, where w i – word forms, r i – word forms delimiters, N – number of word forms in the text M: T  M(T)={(v 1, g 1 ) (v 2, g 2 ) (v 3, g 3 )…(v N, g N )}, where v i define the word form part of speech, g i define the grammatical meaning,

36 S(T)={([v i, g i ] [v i+1, g i+1 ] [v i+2, g i+2 ]), p([v i, g i ] [v i+1, g i+1 ] [v i+2, g i+2 ]), i=1, 2, … N; i – the ordinal number of the word form in the text} ([v i, g i ] [v i+1, g i+1 ] [v i+2, g i+2 ]) – a chain of grammatical meanings

37 Disambiguation M´: T  M´(T) M´(T)={(v 1, g 1 )´ (v 2, g 2 )´ … (v N, g N )´}, where M´(T)  M(T):

38 Software The grammatical marking program interface

39 Receiving S(T i ) In the first column there are triples ([v i, g i ] [v i+1, g i+1 ] [v i+2, g i+2 ]) in the second column there is an information about punctuation within the chain in the third column there is a chain position relative to the sentence beginning in the forth column there is an absolute frequency of the triple ([v i, g i ] [v i+1, g i+1 ] [v i+2, g i+2 ]) in the text

40 Marking the unambiguous word forms on the example of the Commercial Code

41 Disambiguation in the Commercial Code text

42 Results of automatic disambiguation Text T Statistics S(T) Number of word forms Homonymous word forms recognized Homony mous word forms unrecog nized Total Homony mous Unambig uous Totalwrongright Constituti on of Ukraine (72,63%) 3868 (27,37%) Commerc ial Code Constitution (67,64%) (32,36%) 526 (1,78%) 321 (61%) 205 (39%) 34,635 (98,22%) Family Code Constitution + Commercial Code (67,25%) 7705 (32,75%) (71%) 1853 (16,44%) 9420 (83,56%) 4546 (28,74%) Penal code Constitution + Commercial Code + Family Code (68,74%) (31,26%) (80,24%) 3015 (9,2%) (90,8%) 8062 (19,76%)

43 Thank you for attention


Download ppt "Shyrokov Volodymyr, Bugakov Oleg Krygin Maxim, Sydorchuk Nadiia Ukrainian Lingua-Information Fund NASU Ukrainian National Linguistic Corpus and its application."

Similar presentations


Ads by Google