Presentation on theme: "Automatic Term Weighting, Lexical Statistics and …… Quantitative Terminology Kyo Kageura National Institute of Informatics July 05, 2003."— Presentation transcript:
Automatic Term Weighting, Lexical Statistics and …… Quantitative Terminology Kyo Kageura National Institute of Informatics July 05, 2003
Project To rescue/recover the sphere of lexicology To release the richness and productivity of lexico-conceptual sets from the dominance of discourse …… while maintaining the traceable procedure in the process of doing this and starting from textual corpora
Contents Sphere of Texts and Sphere of Lexicon/ology Three (representative) methods of automatic term weighting and their meanings From corpus-based lexical statistics to (still) corpus-based quantitative lexicology Measuring lexical productivity in lexicon (i.e. lexicological concept of productivity) from textual data, with some experiments Conclusions
Textual Sphere and Lexicological Sphere terms lexicon lexicology complex terms quantitative lexicology So what about talking about lexicology when talking about corpus-based… Lexicological Sphere Textual Sphere This exists
Lexicological Sphere and Texts Lexicology deals with actual set of words which does not mean its natural history Lexicological model with expectations addresses realistic possibility of existence, not permissible forms or fantasy land thus actual data is required primary language data is texts Thus becomes recovery of lexicological characteristics the task of lexicology
Automatic Term Weighting (ATW) To review some representative ATW methods gives important insights into the current topic while at the same time giving insights into ATWs We look at Tfidf (its info-theoretic interpretation by Aizawa) Term representativeness (by Hisamitsu) Lexical measure (by Nakagawa) which goes from texts to lexicology, almost.
ATW1: tfidf d1d1 d2d2 …dDdD t1t1 …f ij tTtT Tfidf and many other similar measures, in fact most of what are used in IR, are based on the document-term matrix which has formal duality. Thus the weight of terms is always and only meaningful vis-à-vis the given set of documents or its population (Dfitf thus makes sense, as in probabilistic model).
ATW2: Term representativeness You shall know the meaning of a word by the company it keeps (or see friends to know a person … if there is any, anyway) To calculate the weight of a term t i, take the distribution of words that accompany t i in a certain window size and calculate the distance between this and the distribution of random chunk of the same window size (NB: size normalisation is necessary due to LNRE nature of language data).
ATW2: Term representativeness This method discards the factor of dominant discourse or minor discourse at the level of observed texts (or does not do favor to people who randomly buy friends by money). This method calculates the characteristic that the term ti, if appears at all, can attract at the level of discourse (depending on the nature of window the method takes, of course).
ATW3: Nakagawas method Observe the number of different elements (element types) that accompany t i within the complex lexical units in texts. This reflects, therefore, a nature of lexical productivity of the focal element t i, but together with the degree of its use in discourse (texts)
ATW to Quantitative Lexicology To characterise lexicological nature of elements from their occurrence in texts: As in the method of term representativeness in Hisamitsu, the discourse size factor should be reduced, more essentially; As in Nakagawas method, the point of observation should be limited to complex terms (or those which are supposed to be registered or can be registered to the lexicon/lexicological sphere).
A Quantitative Terminonlogical Study Aim: To recover the productivity of constituent elements of simplex and complex terms as head. Observe, like Nakagawa, the window range of simplex and complex terms in texts, e.g. / / / / /
Some preconditions/assumptions Corpus and the target terminological space should: belong to and represent the same domain cover the same period of time in general matches qualitatively We are concerned with defining a measure which can compare productivity of elements in the same lexicological/terminological sphere.
Definition of measures (a) f(i,N): frequency of ti in the text of size N This is the extent of use in discourse, nothing to do with lexicological productivity d(i,N): number of different complex words whose head is ti in the text of size N the first manifestation of lexicological productivity basically identical to Nakagawa (2000) thus this is the point of departure
Definition of measures (b) d(i,N) means the manifestation of the productivity of ti as it occurs in the corpus d(i,N) is sensitive to the extent of use of the focal element in the textual corpus, e.g. the following can be the case… X=NX=2N d(i,X)500600 d(j,X)400800
Definition of measures (c) Better measure for manifested productivity d(i,λN) the overall transition pattern of d(i,λN) whereλtakes a positive real value (a la Hisamitsu). The measure for potential productivity (i) = d(i,λN);λ discard all the quantitative factor Can be computed by LNRE models
The measures and prob. distributions Three distributions 1) The occurrence probability of heads in theoretical lexicological space. 2) The occurrence probability of modifiers for each head. 3) The probability of use of the head in the text. Relations… f(i,N) 3) d(i) 1) d(i,N) 2),3)
Experiments (1/5) Artificial intelligence abstracts in Japanese 4 elements, i.e. System Model general and knonwledge information (specific) are observed #Abst#Token Smp/Cmp #Type Smp/Cmp 1816299846 / 2307088764 / 23243
Experiments (5/5) f(i,N)S K M I d(i,λN)S M I K d(i)S M K I General elements, such as system or model, have high lexicological productivity, while subject- specific elements, such as knowledge or information, have rather low productivity.
Summary Starting from the observation of ATW methods and going into examining corpus- based quantitative terminological study, we clarified the position of lexicology/lexicon clarified the basic framework of quantitative lexicology/terminology, with relevant measures. gave some corresponding distributions gave the framework of interpretation to measures carried out experiments …
Remaining problems Concepts of lexicologisation and word To be registered to the lexicon To be consolidated as a lexical unit within the syntagmatic stream of language manifestations Distribution of complex words in texts and word unit reference+head vs. modifier+head The former is related to an essential concept(ualisation) of lexicon/lexicology…