# FROM “Meaning”s TO Words İlknur DURGAR EL-KAHLOUT.

## Presentation on theme: "FROM “Meaning”s TO Words İlknur DURGAR EL-KAHLOUT."— Presentation transcript:

FROM “Meaning”s TO Words İlknur DURGAR EL-KAHLOUT

Problem  For a given definition, find the appropriate word (or words), that has a similar definition – traditional dictionary no use

Examples Akımı ölçmek için kullanılan alet  akımölçer (A device that is used to measure the current  ammeter) akımölçer: elektrik akımının şiddetini ölçmeye yarayan araç, ampermetre (ammeter: a device that measures the intensity of electrical current, amperemeter)

Examples Çalıştığı işten kendi isteği ile ayrılmak  istifa (Leaving one’s job voluntarily  resignation) istifa: kendi isteği ile görevden ayrılma (resignation: leaving voluntarily, of a position)

Applications  Computer-assisted language learning  Solving crossword puzzles  Reverse dictionary

Outline  Problem Statement  Challenges  Our Approach  Methods  Results  Result Summary  Conclusion

Problem Statement  For example, one knows the meaning of the word akımölçer (ammeter): Akımı ölçmek için kullanılan alet (A device that is used to measure the current)  However the actual definition of the word in the dictionary is: elektrik akımının şiddetini ölçmeye yarayan araç, ampermetre (a device that measures the intensity of electrical current, amperemeter)

Problem Statement  Find the similarity between two definitions Akımı ölçmek için kullanılan alet (A device that is used to measure the current) elektrik akımının şiddetini ölçmeye yarayan araç, ampermetre (a device that measures the intensity of electrical current, amperemeter)

Meaning-to-Word (MTW)  Meaning-to-Word System (MTW) – attacks the problem of finding the appropriate word (or words), whose meaning “matches” the given definition

Challenges  Two challenging problems – finding words whose definitions are "similar" to the query in some sense. – ranking the candidate words using a variety of ways.

Information flow in MTW User Definition Search in Dictionary Rank Candidates query candidates List of words

Meanings To Words (MTW)  The problem of retrieving words from their "meaning"s at first sight seems to be an information retrieval problem

Information Retrieval (IR)  responds to the user's query by selecting documents from a database and ranking them in terms of relevance.  uses (mostly) statistical and symbolic techniques to retrieve documents for a given query, employing shallow natural language analysis.

Similarities between MTW and IR  Goals – Select relevant items from a collection based on a query  Collections – Collection  Dictionary  Documents: – Documents  Definitions

Similarities between MTW and IR  Approaches: – compare the user request with each of the information in the collection  Ranking: – most important task – But ranking strategies are different

Differences between IR and MTW  Expected results: – Many relevant documents vs. only one correct word  Query Expression: – Keywords vs. sentence (or phrases)  Space size: – Long documents (avg. 300 - 400 words ) vs. one sentence long definitions (avg. 10 - 20 words) – Huge collection(10 6 -10 9 doc) vs. medium dictionary (10 5 word definitions)

Available Resources  Turkish Dictionary  Turkish Wordnet

Normalization User Definition Search in Dictionary Rank Candidates query candidates List of words Normalization

 Tokenization: – All inter-word (non-word, non-digit) symbols eliminated (ex. Punctuation). – Each word is a term  Stemming: – same stem but different affixes – enables matching different morphological variants of the original definition's words  Stop Word Elimination: – have little or no meaning – Frequency (very frequent words) – Linguistic (determiners, prepositions, pronouns,..)

Query Processing User Definition Search in Dictionary Rank Candidates query candidates List of words Query Processing

 Subset Generation: – Search with different set of words – Select informative words from user’s query Query: hiç evlenmemiş kişi (a person who has never been married) * {önce, evlenmemiş, kişi} (before, unmarried, person) * {evlenmemiş, kişi} {önce, kişi} {önce, evlenmemiş} (unmarried, person) (before, person) (before, unmarried) *{evlenmemiş} {önce} {kişi} (unmarried) (before) (person)

Query Processing  Subset Sorting: – Unordered list of subsets are insufficient Top-down sorting – Rank the generated subsets 1) By the number of words Ex: {önce,evlenmemiş, kişi} (before, unmarried, person) vs. {evlenmemiş, kişi} (unmarried, person) 2) By the sum of frequency logarithm Ex:{evlenmemiş, kişi} (unmarried, person) vs. {önce, kişi} (before, person)

Searching for “Meaning”s User Definition Search in Dictionary Rank Candidates query candidates List of words

Searching for “Meaning”s  Two methods – Stem Match – Query Expansion

Stem Match  Morphological normalization of words – Find meanings that contain morphological variants of the original definition

Stem Match (Ex.) {A device that is used to measure the current} { akımı ölçmek için kullanılan alet } ak (white) ölç (measure) için (to) kullan (use) alet (device) akım (current) iç (drink) kul (slave) akı (flux)

Stem Match akımı ölçmek için kullanılan alet - A device that is used to measure the current elektrik akımının şiddetini ölçmeye yarayan araç, ampermetre - a device that measures the intensity of electrical current, amperemeter

Stem Match  Drawback: – Conflate two words with very different meanings to the same stem (ex: yüksek (high)  yüksek (high), yük (load) ilim (science, my city), ilde (in the city)  il (city)) – Cant find relations between similar words (ex: kimse (someone) kişi (person), bölüm (part) kısım (portion))

Query Expansion  The users of retrieval systems often use different words to describe the concepts in their queries than the authors use to describe the same concept in their documents.  In experiments, two people use the same term to describe an object less than 20% of the time.(Furnas 1987).

Using Query Expansion  Two different approaches: Expand query with relations (synonyms, specializations, generalizations) Expand query with unexpanded query’s relevant answers  Synonym relation used in MTW Ex:{besin,gıda} (food, nourishment) {iyileş,düzel} (to get better) /{iyileş,geliş} (to improve)

Query Expansion (Ex.) {A device that is used to measure the current} { akımı ölçmek için kullanılan alet } *ak (white) ölç (measure) için (to) ***kullan (use) alet (device) akım (current) iç (drink)****kul (slave) **akı (flux) *beyaz ölçüm ***faydalan araç **debi ***yararlan gereç **akış ****köle

Query Expansion (Ex.) akımı ölçmek için kullanılan alet - A device that is used to measure the current elektrik akımının şiddetini ölçmeye yarayan araç, ampermetre - a device that measures the intensity of electrical current, amperemeter

Ranking User Definition Search in Dictionary Rank Candidates query candidates List of words

Ranking  The main goal of a retrieval system is to find the documents that are relevant to a query.  Documents that are likely to be more relevant should be ranked at the top and documents that are likely to be less relevant should be ranked at the bottom of the ranked list. (Hiemstra 1999)

Ranking  Most important part of MTW – Having the right answer in the retrieved set is not enough – Aim is to have the right answer at top of the retrieved set (Ex: in first top 50 answers)

Ranking  Simple but effective methods – Subset informativeness (subset sorting) – Number of matched words (subset sorting) – Length of the candidate definition – Longest Common Subsequence

Some statistics  Train sets: – 50 queries from real users – 50 queries from a dictionary  Test sets: – 50 queries from real users – 50 queries from a dictionary Test set 1Train set 2Test set 1Train set 2 # of queries 50 Avg. # of query words 5.664.649.2413.98 Max. # of query words 17122345 Min. # of query words 2116

Stem Match (all stems included) RankTest set 1Train set 1Test set 2Train set 2 1-1013 (26%)18 (36%)45 (90%)41 (82%) 11-507 (14%)12 (24%)2 (4%)5 (10%) 51-1004 (8%)1 (2%) 2 (4%) 101-3003 (6%) 2 (4%)1 (2%) 301-5002 (4%) 0 (0%)1 (2%) 501-10006 (12%)2 (4%)0 (0%) Over 10004 (8%)2 (4%)0 (0%) Not found11 (22%)10 (20%)0 (0%)

Stem Match (longest stem included) RankTest set 1Train set 1Test set 2Train set 2 1-1014 (28%)21 (42%)46 (92%)43 (86%) 11-505 (10%)9 (18%)1 (2%)5 (10%) 51-1004 (8%)1 (2%) 101-3003 (6%)1 (2%)2 (4%)1 (2%) 301-5002 (4%)3 (6%)0 (0%) 501-10005 (10%)2 (4%)0 (0%) Over 10004 (8%)2 (4%)0 (0%) Not found13 (26%)11 (22%)0 (0%)

Query Expansion Match (all stems included) RankTest set 1Train set 1Test set 2Train set 2 1-10 14 (28%)24 (48%)45 (90%)41 (82%) 11-509 (18%) 2 (4%)5 (10%) 51-1003 (6%) 1 (2%)2 (4%) 101-300 7 (14%)2 (4%) 1 (2%) 301-5000 (0%)1 (2%)0 (0%)1 (2%) 501-10004 (8%)5 (10%)0 (0%) Over 10004 (8%)1 (2%)0 (0%) Not found9 (18%)5 (10%)0 (0%)

Query Expansion Match (longest stem included) RankTest set 1Train set 1Test set 2Train set 2 1-1014 (28%)24 (48%)41 (82%)39 (78%) 11-506 (12%)8 (16%)5 (10%)6 (12%) 51-1005 (10%) 0 (0%)2 (4%) 101-3007 (14%)2 (4%)0 (0%)2 (4%) 301-5001 (2%) 0 (0%) 501-10005 (10%)3 (6%)0 (0%) Over 10003 (6%)2 (4%)1 (2%) Not found9 (18%)5 (10%)0 (0%)

Data fusion  No single method is better than all others in all cases  Merging results from different methods seems to be promising approach for achieving improved performance  Many data fusion methods including min, max, average, sum, weighted average and other linear combination functions

Data Fusion  Weighted Sum

Data Fusion c 1 = 0.7 (stem match const.) c 2 = 0.3 (query expansion const.) RankTest set 1Train set 1 1-1015 (30%)22 (44%) 11-5010 (20%)14 (28%) 51-1004 (8%)1 (2%) 101-3003 (6%)2 (4%) 301-5003 (6%)0 (0%) 501-10005 (10%)3 (6%) Over 1000-- Not found11 (22%)8 (16%)

Result Summary  Stem Match (longest stem included) 60% real user queries 96% dictionary queries  Query Expansion (all stems included) 68% real user queries 92% dictionary queries  Data Fusion (longest stem included) 72% real user queries

Conclusion  Meaning to Word system is implemented for Turkish language  Results on unseen data are rather satisfactory  Query expansion is better Although, it can not find the words for all queries 68% of real user queries and 90% of dictionary queries are found in the first 50 results  Data fusion has a better performance 72% of real user queries are found in first 50% results

THANK YOU !!