Presentation is loading. Please wait.

Presentation is loading. Please wait.

CrossLexica: A Large Electronic Dictionary of Collocations and Semantic Links in Russian Igor A. Bolshakov National Polytechnic Institute Mexico City,

Similar presentations

Presentation on theme: "CrossLexica: A Large Electronic Dictionary of Collocations and Semantic Links in Russian Igor A. Bolshakov National Polytechnic Institute Mexico City,"— Presentation transcript:

1 CrossLexica: A Large Electronic Dictionary of Collocations and Semantic Links in Russian Igor A. Bolshakov National Polytechnic Institute Mexico City, Mexico

2 Synopsis Electronic dictionary – CrossLexica – of collocations and semantic links in Russian is developed, with especial stress to collocations. It contains a vocabulary of 185,000 entries and a matrix of classified links between these entries. As many as 1.75 million nonempty syntagmatic links reflect the same quantity of collocations. So CrossLexica exceeds any monolingual dictionary by volume. CrossLexica´s structure restores and gives out a collocation in its true grammar form when a query contains any its collocate. Thus the problem of two-sided data inversion is easily solved. CrossLexica feasibly generates lacking collocations of available collocates, manages the order and the recall of delivery to the screen and (if needed) rejects unwanted stylistic elements. The main operational mode is interactive, primarily for creating and editing Russian texts. CrossLexica also has a special outer link for numerous non-interactive applications, which are not manageable or poorly manageable by available linguistic tools.

3 Exposition plan Topic domains covered General features and features of entries Various types of links between entries Collocations The most frequent collocates Semantic links Semantic links support collocations Other linguistic resources Tags of idiomaticity and colloquialism Users options Interactive applications Non-interactive applications An example of a delivery to the screen Profis opinion on CrossLexica Dreams on CrossLexicas future

4 Topic domains covered Economy and business Politics and political science Engineering and technologies (electronics, computers, programming, cars, construction, etc.) Exact, hard, and natural sciences (mathematics, physics, chemistry, biology, geology, geography, etc.) Humanities, arts, and religion (linguistics, history, confessions, etc.) Medicine (mainly of everyday life) Colloquial language (a lot of purely colloquial and abusive words and expressions)

5 General features Total vocabulary size, entries 185,000 Nouns 31% Verbs 21% Adjectives 27% Adverbs 21% Includes homonymous groups2,300 with total number of various senses5,400 Total amount of collocations 1.75 million Total amount of semantic links 2 million Paronymous links of two types 200,000

6 Dictionary entries may be Noun entry (noun: separate entries for singular and plural) Verb entry (personal forms + infinitive: perfect and imperfect aspects separately) Adjective entry (adjectives or participles: two aspects separately ) Adverb entry (adverbs or gerunds: two aspects separately ) Auxiliary words (prepositions, conjunctions) are built into collocations and usually havent entries of their own. Predicative utterances like а пошел ты go to hell are considered adverbials.

7 Noun entry may describe An individual noun: абберацияabberation, аббревиатура abbreviation, абзац paragraph, битва battle, бифштекс steak, блага goods,... A stable noun group: алкогольные напиткиalcoholic drinks, сельское хозяйство agriculture, точка зрения point of view, уровень жизни life level, болеутоляющие средства analgesics...

8 Verb entry may describe An individual verb говорить to speak, идти to go, обсуждать to discuss, спать to sleep,... A verb with reflexive pronoun вести себяto conduct oneself, чувствовать себя ´to feel oneself,... A verb group наводить страхto be horrid, оказывать вниманиеto pay attention, испытывать стремление to aspire...

9 Adjective entry may describe An individual adjective абстрактный ´ abstract ´, автономный autonomous, воздушно- реактивный aerojet... An individual participle (maybe adjectivized) aглютинирующий agglutinizing, агонизирующий agonizing, вдвинутый moved in, возимый being carried by, коррумпированный corrupt... An adjective group хорошо одетый well-dressed, большой дальности of long range, бросающийся в глаза conspicuous, бывший в употреблении second hand, из ряда вон выходящий outstanding, с маслом dressed with oil, как бархат like velvet, как сталь as steel, без правил no rules, большого ума of great wisdom...

10 Adverb entry may describe An individual adverb абсолютно absolutely, абстрактно in an abstract way, аляповато garishly, быстро quickly, долго for a long time, плохо badly, по-хорошему in an amicable way, удовлетворительно satisfactorily... An individual (specifically Russian) gerund базируясь (while) basing, надев (after) putting on, удовлетворившись (after) being satisfied... An adverb group аккуратным образом in an accurate way, без воодушевления without enthusiasm, более или менее more or less, как выжатый лимон as a squeezed lemon, как лед as ice, в особой степени to a high degree, куда попало anywhere, на цыпочках atiptoe, долгое время for a long time...

11 Links between entries divide into Syntagmatic links (= collocations = word combinations) думать о былом to think about the past, самолет садится the plane is landing, хорошо поесть to eat well, очень длинный ´very long, сотрудничество с британцами ´cooperation with Britons, предельно внимательно in a quite attentive manner… Semantic links Synonyms дурак fool – болван blockhead Semantic derivates Москва 1 Moscow – москвичи Muscovites... Part – whole террариум terrarium – зоопарк zoo Genus – species документ document – диплом 1 diploma Antonyms длинный long – короткий short Paronymous links (similarity in letters or morphs) кадка – каска, качка...; бег – бегун, бега, пробежка...

12 Collocations Collocation is a pair of content words (=collocates) syntactically linked and stably compatible in meaning Syntactic dependency link between collocates can include an auxiliary word (preposition or conjunction) content word 1 (auxiliary word) content word 2 сотрудничество ради мира cooperation for peace Each collocation is accessible from any its collocate. Hence the number of the unilateral links doubles the number of collocations

13 The most numerous collocation types Modificative pair noun & adjective or verb / adjective / adverb & adverb: краснокочанная капуста red cabbage, явный наглец impudent fellow, резко высказаться ´to bluntly state, полностью ясный completely clear, ужасно рад awfully glad... Verb & directly / indirectly / prepositionally complementing noun : рассмотреть вопрос to analyze a problem, воротить нос to turn up ones nose, остаться из-за погоды to stay because of the weather… Noun subject & verbal or adjectival predicate: самолет вылетел the plane took off, внимание привлечено attention was caught, доклад (был) краток the talk is(was) short... Noun & subordinated noun: сердце матери mothers heart, отличия в произношении differences in pronunciation, борьба против терроризмаstruggle against terrorism...

14 Some other types of collocations Adjective & directly, indirectly, or prepositionally complementing noun: красный от стыда red with shame, покрытый навозом covered with manure, согретый солнцем warmed by the sun, открытый для публики open for the public... Verb & complementing infinitive: собраться поехать to prepare to go, мечтать выкупаться to dream to take a bath, хотеть перекусить to wish to have a snack... Noun & complementing infinitive: соблазн сказатьtemptation to say, желание уйти wish to leave, проблема выжить problem to survive... Verb & complementing adjective: быть нормальным to be normal, вернуться здоровым to return healthy, найти мертвым to find dead … Stable coordinate pairs: автобусы и троллейбусы buses and trolleybuses, ясный и четкий clear and well-defined, экономический и культурный economic and cultural, быть или не быть to be or not to be, взвесить и решить to ponder and to decide, власть и бизнесthe power and the business, в срок и в полном объеме in time and in full, наука и техника science and technology...

15 The most frequent collocates (1/4) Nouns with maximal number of governing verbs 513 работа 1 work363 руки hands 456 деньги money337 дело 1 business 411 ребенок child 327 книга book 386 местo 1 place 321 дорога way 374 дом 1 house 302 глаза eyes 366 дети children 301 город city

16 The most frequent collocates (2/4) Nouns with maximal number of modifiers 1173 человек (hu)man 510 вид 1 view 709 лицо 1 face 506 режим 2 mode 549 работа 1 work 494 голос 1 voice 539 глаза eyes 433 покрытие 1 cover 534 женщина woman 408 препараты preparates 527 взгляд 1 look 400 анализ analysis/test

17 The most frequent collocates (3/4) Verb s with maximal number of complements 2284 быть to be 963 считать to consider 2185 иметь to have 959 вести to do 1442 находиться to be 951 оказаться to turn to be 1270 стать to become 918 требовать to require 1095 начать to begin 916 использовать to use 1068 получить to get 910 провести to do

18 The most frequent collocates (4/4) The most common adjective modifiers 2943 большой big 1401 полный 1 complete 2037 крупный large 1309 явный evident 1739 небольшойsmall 1198 огромный huge 1592 новый new 1163 многочисленный numerous 1456 постоянный stable 1149 сильный strong

19 Semantic links Synonyms: 19,000 synsets of 5.6 members; unilateral links – 1.2 million Semantic derivates: the groups like {извлечение extraction; извлекать to extract; извлеченный extracted, извлекший extracting; извлекая while extracting, по извлечении after extraction, путем извлечения by extraction}; unilateral links – 0.9 million Part (or quantifier) Vs. whole, unilateral links – 25,000 Genus Vs. specie, unilateral links – 14,000 Antonyms, unilateral links – 12,000

20 Semantic links support collocations Semantic links help to comprehend the meaning of the entry keywords. Glosses are absent except for homonymous entries, but English translations are to be ubiquitous. A set of collocations lacking in the matrix is generated automatically at runtime, based on synonymy and hyponymy of the available collocates: (bunch of flowers) & (asters IS_A flowers) (bunch of asters) The correctness of the collocations thus generated is not guaranteed, and this is shown by low contrast of their delivery.

21 Other linguistic resources Literal paronyms кадка: кака, каска, качка, кашка, кладка Morphemic paronyms ( бег is the common stem ) бегающий, беглый, беговой, бегучий,... Morphological paradigms for nearly all inflective keywords English translations of Russian vocabulary entries, which, taken together, form a separate dictionary to access CrossLexicas resources

22 Tags of idiomaticity at separate entries or collocations no tagdirect meaning only (идти в школу to go to school ) (idiom)idiomatic (figural) meaning only (сесть в галошу to get into a fix, lit. to sit down into a galosh ) (mb idiom) direct or idiomatic meaning (сесть в лужу lit. to sit down to a puddle and also to get into a mess, первая ракетка lit. the first racket and also tennis champion ) The symbol on the screen

23 Tags of colloquialism level ( style) at separate words or collocations no tag Its common word / collocation; use it without any restrictions (стена wall, окно window, книга book, налоги taxes...) Its a special, bookish or obsolete word / collocation; use it if you dont fear to be unclear (абсцесс ´ abscess´, парадигма ´paradigm´, адъективный ´adjectival´, аутсорсинг ´outsourcing´, роуминг ´roaming´... ) Its a purely colloquial word / collocation; dont use it in official documents (мотать нервы to squander the nerves, жевать сопли to chew ones snot... ) Its an abusive word / collocation; dont use it at ladies and children, or in an official environment (говно ´shit´, жопа ´ass´, мудак asshole, взять за яйца to catch hold of the nuts... ) Its common in speech, but the scholars do not recommend it; so it should be reworded (оплатить за проезд, проплатить операцию) The symbol on the screen

24 Users options The following two options of the whole dictionary can be chosen: Russian, with menu items, names of delivery sections, and glosses of homonym senses given in Russian, or English, with all mentioned given in English. At runtime, the user can: Select alphabetic order to deliver some types of collocations or statistical order (those with more frequent collocates coming first). The cutting level for rarer collocates can be adjusted to prevent novices drowning in special or rare words. Forbid delivering to the screen of abusing, colloquial or special words with all their collocations. Enter the query through the keyboard, or select it in the vocabulary list, or select it in History list, or select it in the current collocation list on the screen. The latter option takes the indicated collocate as a new query, thus beginning navigation through the vocabulary.

25 Two types of possible applications Interactive applications The user puts questions to the dictionary in interactive mode and can use the results, e.g., for the parallel text editing or language learning Non-interactive applications An outer program applies to the dictionary for a reference and uses the results for its own purposes

26 Interactive application 1 Perfecting Russian speakers skills Reference to the collocation ходить в школу (two possible ways) to go to school Query 1 : ходить to go In the delivery:.... HAS GOVERNING PATTERNS:.... ходить в кого / во что / куда?.... ходить в университет ходить в учреждения ходить в храм ходить в церковь ходить в цирк ходить в школу Query 2 : школа school In the delivery:.... GOVERNED BY VERBS:.... руководить школой создать... при школе уйти из школы уходить из школы учиться в школе ходить в школу шефствовать над школой являться школой...

27 Interactive application 1 Perfecting Russian speakers skills What valencies has the verb забыть to forget ? забыть что / кого? what / whom? забыть адрес, багаж, вкус, времена, время, вчерашнее,... (101 col.) забыть о чем / о ком? about what / about whom? забыть о времени, обо всем, о вчерашнем, о главном,... (37 col.) забыть про что / про кого? about what / about whom? забыть про все, про главное, про детей, про диссертацию, про семью,...(22 col.) забыть … в чем / в ком / где? … in what / where? забыть... в вагоне, в гостях, в комнате, в кафе, в ресторане,... (9 col.) забыть … на чем / на ком / где? …on what / where? забыть... на диване, на кресле, на кровати, на окне,... (7 col.) забыть … при чем / при ком? …while what? забыть... при декларировании, при зачтении,.. (3 col.) забыть … по чему / по кому? …because of what / why? забыть... по рассеянности, по невнимательности (2 col.), забыть … из-за чего / из-за кого / почему? …because of what / why? забыть... из-за волнения, из-за спешки (2 col.) забыть … за чем / за кем? …behind what / because of what? забыть... за давностью (1 col.) забыть … от чего / от кого / откуда? …because of what/ from where? забыть... от волнения (1 col.)

28 Interactive application 1 Perfecting Russian speakers skills More examples How can be expressed by verb платa за проезд transport payment ? платить / оплатить / оплачивать проезд or заплатить за проезд (проплатить проезд and оплатить за проезд are also included but marked as unsuggestible) How can бразильские женщины Brasilian women be reworded? – бразильянки. And иракские женщины Iraqi women ? – Only this way! (But иракец Iraqi man and иракцы Iraqi men do exist!) How can somebody «cause» иск suit? внести / возбудить / вчинить / подать / предъявить иск, as well as обратиться с иском What does the abbreviation РФФИ mean? – It has two senses: - Российский фонд федерального имущества - Российский фонд фундаментальных исследований

29 Interactive application 1 Perfecting Russian speakers skills Distinguishing morphemic paronyms вероятный probable IS MODIFIER FOR: адресaddress альтернативаalternative вариантoption версияversion визитvisit встречаmeeting выборchoice гипотезаhipothesis запасыstocks изменениеchange......... вероятностный probabilistic IS MODIFIER FOR: автоматautomaton алгоритмalgorithm анализanalysis анализаторanalyzer аспектыaspects выводinference задачаtask идеиideas контрольcontrol.........

30 Interactive application 1 Perfecting Russian speakers skills Word sense disambiguation доменный 1 domain (in attr. use) IS MODIFIER FOR: адресaddress аукцион auction беспредел violations бизнес business границы borders зона zone имена names карта map контроллер controller новости news протокол protocol регистрация registration......... доменный 2 blast furnace (in attr. use) IS MODIFIER FOR: воздухонагреватель air heater газы gases кокс coke конструкция construction мастера masters печи furnaces подъемник elevator производство production процесс process стенки sidewalls.........

31 Interactive application 1 Perfecting Russian speakers skills Disambiguation of quasi-homonyms личный personal IS MODIFIER FOR : автомашинаcar автомобильautomobile автотранспорт motor transport адъютантadjutant амбицииambitions антипатииantipathies архивarchive аспектaspect багажluggage безопасностьsecuirity беседаconversation библиотекаlibrary......... личной face/facial IS MODIFIER FOR : карманpocket кремcream напильник(smooth) file нашивкиchevrons полотенцеtowel пуговицыbuttons салфеткаnapkin сторонаside.........

32 Interactive application 2 A dvice for an advanced learner of Russian All queries typical for a Russian user seem valuable for a foreigner, plus: Getting references on the orthography and morphology of any word. E.g. the noun Христос Christ has its own declination pattern ( Христа, Христу, Христом... ) Accessing through English dictionary E.g., for the verb pay as many as 11 Russian verbs are got обращать, обратить, окупать, окупить, оплатить, оплачивать, платить, уделить, уделять, уплатить, уплачивать and through each of them relevant information can be obtained.

33 Non-interactive application 1 Facilitating text parsing Вышеупомянутые механизмы обрушения вызваны перегрузкой антресолей в процессе эксплуатации. Collocations are searched in the sentence to be parsed; the greater is the number of correct collocations found in a given parsing variant, the more probable it is.

34 Non-interactive application 2 Word sense disambiguation Хамовнический суд... начнет рассмотрение иска управления Федеральной службы 1 исполнения наказаний России по Москве к адвокатам осужденных. As much as compatible neighbors служба 1 service Vs. служба 2 servicing Хамовнический суд... начнет рассмотрение иска управления Федеральной службы 2 исполнения наказаний России по Москве к адвокатам осужденных. As few as compatible neighbors The Khamovniki court … will start to condider the suit of Administration of Russian Federal Penitentiary Service/Servicing in/for Moscow against the convicts attornies Collocations and semantic links are searched compatible with various senses of a homonymous word. The sense with greater number of syntagmatically or semantically compatible neighbors is preferred.

35 Non-interactive application 3 Detection and correction of malapropisms hysteric истеричного historical исторического центризма centrism sterical стерического цента cent... посещение истерического центра Москвы... … visiting the hysterical center of Moscow … Syntactically linked pairs are detected that are not correct collocations. For each word in the pair found, all paronyms are searched through, as well as all collocations available for them in the dictionary. The collocations found are proposed to the user for their verification.

36 Non-interactive application 4 Steganography and steganalysis (Electronic water-marking problem) - explanation Collocations and synonyms of words occurring in a text are used for controlled change of some word to their synonyms in order to encipher in these changes an independent information, which is thus transferred covertly by the carrier text without any alteration of its meaning.

37 Non-interactive application 4 Steganography and steganalysis - example Пять на юге составляла от 2,2 до 3,1 балла по шкале Рихтера, сообщили на Акташской сегодня замечено зарегистрировано зафиксировано отмечено землетрясений подземных толчков за 24 часа за сутки Алтая. Республики Алтай. Магнитуда Мощность Мощь Сила землетрясений подземных толчков сейсмической станции сейсмостанции после полудня. во вторую половину дня. B Obama H Clinton synonyms

38 Other non-interactive applications Advanced information search Internet query is automatically enriched by some collocates of the query words. Idiomatic translation of English collocations into Russian ones As an answer to the query strong woman, CrossLexica outputs now крепкая баба, сильная женщина... Automatic splitting of text to paragraphs The frontiers between sentences are searched which are crossed by the minimal number of links of any type. etc.

39 Example of delivery for чувство emotion Translations

40 Instead of conclusion: Opinion of Prof. Igor Melčuk, Canada CrossLexica is unique in its genre. As far as I know, no similar dictionary exists for any language. A few published dictionaries of collocations (English and French) cannot even be compared with CrossLexica as far as the number of phrases described, the wealth of lexicographic information supplied, and the logic of dictionary organization. The whole text of the letter of evaluation is presented separately.

41 My first dream is To supply CrossLexica to several users groups USERS GROUP Any Russian who manages PC, mobile phone or Internet terminal (officials, businesspersons, scientists, students, etc.) [There are 50 millions Internet users and more than 60 millions mobile phones in Russia now.] A dweller of countries nearby Russia (Ukraine, Baltic States, Poland, Middle Asia countries, etc.) wishing to restore or acquire knowledge of modern Russian (businesspersons, potential migrants or students, etc.) [Russia accepts now about 7 millions migrants. More than 20 millions Russian speakers are out of Russia.] A dweller of Western countries (USA, UK, Canada, France, Germany, Italy, Spain, Scandinavian States, etc.) already knowing somewhat in Russian but wishing to improve their skills (businesspersons, Russian émigrés, teachers of Russian, Slavists, etc.) [More than 300,000 specialists have left the ex-USSR for the Western countries after 1991.] ESTIMATE to 1 million to 100,000 to 10,000

42 My second dream is To see non-interactive applications working Detection and correction of malapropisms: there exists an algorithm and it is tested in limited experiments Word sense disambiguation : experiments are under preparation Steganography and steganalysis : there exists a scratch algorithm Facilitating text parsing : only an idea Advanced information search : only an idea To implement all these tasks in parallel with bringing CrossLexica to perfection is already impossible for me. However I am ready to give suggestions and consultations.

43 My third dream is To see CrossLexica implemented for English The structure of CrossLexica-Eng can be basically the same (the differences are rather clear for me). Russian collocations can be multiply translated word-to-word and then the trash should be filtered out automatically through Web search engines and mentally by lexicographers. As an initial supply, the collocations from the available academic dictionaries can be taken. The work group for the task has better to be headed by a native English speaker and should include English lexicographers. My estimate of effort consumption is 20 person × years and is not less than 3 years. My experience is valuable, but only for consulting.

44 Thus: The Worlds Largest Combinatorial Dictionary © Database Content, Grammar: I. Bolshakov, 2009 © Uploading Utilities: I. Bolshakov, A. Gelbukh, 2009 © User Interface: A. Gelbukh, 2009

45 Thank you for your attention! Any questions? Want to see CrossLexica functioning? Prof. Igor A. Bolshakov

Download ppt "CrossLexica: A Large Electronic Dictionary of Collocations and Semantic Links in Russian Igor A. Bolshakov National Polytechnic Institute Mexico City,"

Similar presentations

Ads by Google