Presentation on theme: "A Comparable Corpus Driven, Multivariate Approach to Light Verb Variations in World Chineses Jingxia LIN 2, Menghan JIANG 1, and Chu-Ren HUANG 1 1 The."— Presentation transcript:
A Comparable Corpus Driven, Multivariate Approach to Light Verb Variations in World Chineses Jingxia LIN 2, Menghan JIANG 1, and Chu-Ren HUANG 1 1 The Hong Kong Polytechnic University, 2 Nanyang Technological University
Light verbs in Chinese Similar to English light verbs: take rest, give advice, give description Semantically bleached: containing no eventive information The predicative content mainly comes from its taken complement 進行討論 jin4xing2 tao3lun4 ‘have a discussion’ Being semantically bleached, they do not strongly select their objects They can take a wide range of objects, including deverbal nouns, eventive nouns, and sometime concrete numbers with eventive meaning They are sometimes interchangeable with the same nominal object
Underspecified Selecitonal Restriction of Chinese Light Verbs 從事 cong2shi4, 搞 gao3, 加以 jia1yi3, 進行 jin4xing2, 做 zuo4 are among the most frequently used (also most typical) light verbs in Modern Chinese The use of these five light verbs are sometimes interchangeable 從事 / 搞 / 加以 / 進行 / 做研究 cong2shi4/gao3/jia1yi3/jin4xing2/zuo4 yan2jiu1 “to do research”
Underspecified Selecitonal Restriction of Chinese Light Verbs II Collocation constraints are sometimes found with these light verbs, e.g., 進行 /* 加以 /* 從事 / 搞 /* 做赛事, jin4jing2/*jia1yi3/*cong2shi4/gao3/*zuo4 bi3sai4 “play a game” * 進行 / 加以 /* 從事 /* 搞 /* 做考慮 *jin4jing2/jia1yi3/*cong2shi4/*gao3/*zuo4 kao3lv4 “give consideration”
Variations of Light Verb Usages in Mainland and Taiwan Mandarin Variants Even with the very limited collocation constraints, variations still exist: Taiwan light verbs tend to take more types of NPs and even VPs as its complements 進行感恩之旅 / 君子之爭 Jin4xing2 gan3en1zhi1lv3/ju1zi3zhi1zheng1 “to proceed with a ‘thanksgiving trip’/‘gentlemen’s dispute’” 進行抹黑 / 開票 Jin4xing2 mo3hei1/kai1piao4 “to proceed with ‘mud-slinging’/’ballot counting’ ” -------(Huang et al. 2013)
Theoretical Challenges for Corpus-based Studies of Chinese Light Verbs Can distribution based statistically analysis identify the differences among different Chinese light verbs? The contrasts among the light verbs are often tendencies rather than grammaticality dichotomies; hence the distributional patterns are less prominent and harder to characterize Can the subtle light verb variations between different variants of Chinese, be identified through statistical analysis based on comparable corpora (cf. Huang et al. 2013).
Main Research Questions Facing the above challenges, we try to resolve the following four research questions: Can light verbs be differentiated from each other by statistical methods? Can the grammatical differences between variants of the same language be empirically verified by distributional features? Are these differences statistically significant? If answers to both questions are yes, how do they differ statistically from each other? That is, is the distributional difference between two different light verbs or the between two variants of the same light verb more prominent?
Methodology A comparable-corpus-driven statistical approach 加以 jia1yi3, 進行 jin4xing2, 從事 cong2shi4, 搞 gao3, 做 zuo4 in Mainland Mandarin and Taiwan Mandarin Statistical methods and tools Univariate analysis + multivariate analysis Polytomous package in R (Arppe 2008)
Data Chinese Gigaword corpus ( over 1.1 billion Chinese words) Central News Agency (Taiwan, about 700 million characters) Xinhua News Agency (Mainland China, about 400 million characters) Random sample: 200 sentences for each of the five light verbs in Mainland and Taiwan corpora 1,000 in total for Mainland Chinese 1,000 in total for Taiwan Chinese
12 factors: (e.g. Zhu 1985, Zhou 1987, Cai 1982, Huang et al. 1995, among others) Value levels Co-occur with other light verbs “OTHERLV” 開始 進行 比賽 kai1shi3/jin4xing2/bi3sai4 “start the game”Yes, no Take aspectual marker: 著，了， 過 “ASP” 昨天進行了比賽 zuo2tian1/jin4xing2/le0/bi3s ai4 “played the game yesterday” No, le, zhe, guo Event complement is at subject position “EVECOMP” 比賽在學校進行 bi3sai4/zai4/xue2xiao4/jin4 xing2 “play the game at school” Yes, no
POS “POS” 進行比賽（ N ） jin4xing2/bi3sai4 進行戰鬥（ V ） jin4xing2/zhan4d ou4 “play the game” “fight the battle” N, V Argument structure “ARGSTR” 進行調查（ two ） jin4xing2/diao4ch a2 “carry on investigation” One, two, zero VO compound as argument “VOCOMP” 進行投 票 jin4xing2/tou2pia o4 “carry on voting” Yes, no
Spontaneous/contr ollable event “SPONTEVT” 進行投票 jin4xing2/tou2piao4 “carry on voting”Yes, no durative event “DUREVT” 進行比賽 jin4xing2/bi3sai4 “play a game”Yes, no formal event “FOREVT” 進行訪問 jin4xing2/fang3wen4 “pay an official visit” Yes, no psychological activity “PSYEVT” 加以考慮 jia1yi3/kao3lv4 “give consideration” Yes, no event involving interaction of agent and patient “INTEREVT” 進行溝通 jia1yi3/gou1tong1 inflict/communicate “do communication” Yes, no accomplishment complement “ACCOMPEVT” 進行修正 jin4jing2/xiu1zheng4 proceed/correct “make corrections/amen dments” Yes, no
Mainland Chinese-An overall look of the factors > str(MLLV3) 'data.frame':1000 obs. of 13 variables: $ LV : Factor w/ 5 levels "congshi","gao",..: 1 1 1 1 1 1 1 1 1 1... $ POS : Factor w/ 2 levels "N","V": 2 2 2 2 1 1 2 2 2 2... $ ARGSTR : Factor w/ 3 levels "one","two","zero": 1 1 2 1 3 3 2 1 1 1... $ VOCOMP : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1... $ EVECOMP : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1... $ OTHERLV : Factor w/ 1 level "no": 1 1 1 1 1 1 1 1 1 1... $ ASP : Factor w/ 4 levels "guo","le","no",..: 3 3 3 3 3 3 3 3 3 3... $ SPONTEVT : Factor w/ 1 level "yes": 1 1 1 1 1 1 1 1 1 1... $ DUREVT : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2... $ FOREVT : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2... $ PSYEVT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1... $ INTEREVT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1... $ ACCOMPEVT: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1... Among the 12 independent variables, two have only one level OTHERLV: occurrence of the dependent variable (light verbs) with another light verb All five light verbs (1000 sentences) do not co-occur with another light verb SPONTEVT: with spontaneous events as the complement to light verbs All five light verbs (1000 sentences) take spontaneous events as their complements the two factors are not effective in distinguishing the five light verbs, and are thus excluded from further statistical analysis
Univariate analysis of Chinese light verbs Chi-squared tests for the significance of the co-occurrence of the factor with individual light verbs Chisq.posthoc() function in the Polytomous package automatically transforms the results (Standardized pearson residuals e ij (Agresti 2002)) into signs “+”: e ij > 2, statistically significant overuse of the light verb with the factor “-”: e ij < -2, statistically significant underuse of the light verb with the factor “0”: e ij [-2,2], lack of statistical significance
Mainland Chinese – a univariate analysis Four features show no significance (p-value <0.05) in distinguishing the five light verbs.
Mainland Chinese – a univariate analysis Also the table presents that each light verb shows significant preference for certain factors.
Polytomous Logistic Regression 加以 / 進行 / 從事 / 搞 / 做 研究. Jia1yi3/jin4xing2/cong2shi4/gao3/zuo4 yan2jiu1 “to do research” Five light verbs as the possible outcome Estimate the probability of presence of each of the potential light verb Polytomous logistic regression An extension of standard logistic regression allows for simultaneous estimation of the probability of multiple outcomes (light verbs in the current study)
Main Results of Polytomous for Mainland Chinese odds>1: the chance of the occurrence of a light verb is significantly increased by the feature (marked in orange) odds<1: the chance of the occurrence of a light verb is significantly decreased by the feature (marked in blue) Non-significant odds (p-value >0,05) are given in parentheses
Distributional Contrasts Can Differentiate Light Verb Pairs Most pairs of light verbs can be effectively differentiated by one of more factors (i.e. those where they have contrasting positive/negative tendencies to appear) congshi/gao: ARGSTRtwocongshi/jiayi: ARGSTRtwo congshi/jinxing: INTEREVTypesgao/jiayi: ACCOMPEVTypes gao/zuo: ARGSTRtwo/ARGSTRzerojiayi/jingxing: ACCOMPEVTypes jiayi/zuo: ARGSTRtwojinxing/zuo: INTEREVTypes Only two pairs are without contrasting significant features congshi/zuo gao/jinxing
A probability model is adopted to predict the identity of light verb at its position of occurrence. The overall performance of the model is good the most frequently predicted light verb of each column corresponds to the light verb that actually occurs in the data (see the red figures) PROBABILITY OF OCCURRENCE OF LIGHT VERBS
F-score of Automatic Identification of Five Light Verbs Based on Mainland Mandarin Data recallprecision F-score congshi0.6550.46450.5436 gao0.080.50.1379 jiayi0.960.44550.6086 jinxing0.310.69660.4291 zuo0.4850.58430.5300
Each light verb can be successful identified with a better F-score than chance (0.2) with the exception of 搞 gao3, while the performance varies from light verb to light verb 加以 Jia1yi3 > 從事 cong2shi4/ 做 zuo4 > 進行 jin4xing2 > 搞 gao3 - 加以 Jia1yi3 is the only light verb with effective differentiating factors with all other light verbs.// All four significant factors are positive (i.e. direct evidence for its occurrence). 事 cong2shi4/ 做 zuo4: Both have only one type of significant factors, but they are negative ones (i.e. indirect evidence). 搞 gao3, and 進行 jin4xing2 have both positive and negative factors, which may have cancelled each other out. The significance of their factors are also relatively weak. Note that the low f-score of 搞 gao3 is consistent with the linguistic observation that it is rarely used as LV in ML. Analysis of Outcome (ML)
F-score of Automatic Identification of Five Light Verbs Based on Taiwan Mandarin Data recallprecision F-score congshi0.320.56140.4076 gao0.6950.50360.5840 jiayi0.950.41390.5766 jinxing0.3350.59290.4281 zuo0.160.84210.2689
Each light verb can be successful identified with a better f-score than chance (0.2). But the performance varies from light verb to light verb 搞 gao3/ 加以 Jia1yi3 > 進行 jin4xing2/ 從事 cong2shi4 > 做 zuo4 搞 gao3/ 加以 Jia1yi3 each have significant factors are positive only (i.e. direct evidence for its occurrence). 從事 cong2shi4 negative significant factors only (i.e. indirect evidence). 進行 jin4xing2 has more positive than negative significant factors 做 zuo4 have both types of significant factors, but negative ones outnumber positive ones. Linguistically, Analysis of Outcome (TW)
Key results: ML and TW 做 zuo4 show opposite usage tendency of the feature ARGSTR.two ML and TW 進行 jin4xing2 show opposite usage tendencies of the features ASP.le and ASP.no But the difference is between a significant and non-significant feature, rather than between a significant positive vs. a significant negative feature Comparison of Mainland and Taiwan light verbs -univariate analysis
Probability estimates of Mainland and Taiwan light verbs by Polytomous In both ML and TW, the model in overall is good: the most frequently predicted light verb of each column corresponds to the light verb that actually occurs in the data (see the red figures) The results also show while a light verb has a highest probability given a particular context (a set of factors), other light verbs might also have a chance to occur. the reason why empirically more than one light verb can occur in the same context.
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression
Both have similar, non-contradictory distributional patterns. They differ only in that TW is less likely to take formal event as arguments (FOREVTyes). This is consistent with the intuition that jingxing will be preferred in this context in TW.
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non-contradictory distributional patterns. Both ML and TW 搞 gao3 are significantly favored by ML 搞 gao3 is less likely to occur with accomplishment object. This and the fact that it is unlikely to occur with the aggregate of default variable values suggest that it is unlikely to be used as light verb in ML.
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non- contradictory distributional patterns ML 加以 jia1yi3 are more likely to occur with two arguments (ARGSTRtwo), as well as taking VO compound or psychological events as objects (VOCOMPyes, and PSYEVTyes). Which confirms the intuition that it is more frequently used in ML.
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non-contradictory distributional patterns. ML jinxing is not likely to take accomplishment objects (ACCOMPEVTypes), while TW 進行 jin4xing2 is very likely to take VO compound objects (VOCOMPyes), consistent with Huang et al. (2013)
Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non-contradictory distributional patterns Their distributional patterns are consistent with the analysis of zuo4 as the most bleached of Mandarin light verbs. (The attachment of perfect aspect –le is known to be shared grammatical potential of all light verbs.)
Conclusion This study compares the usage tendencies of Chinese light verbs (1) Among five different light verbs (2) Between Mainland and Taiwan Mandarin Usage of the same light verb The comparable-corpus-driven statistical analysis is able to generalize about the similarities and differences among light verbs with different factors The contrast between different light verb pairs can be anchored by statistically significant positive vs. statistically significant negative pairs, The difference between two Chinese varieties for the same light verbs, however, is between statistically significant vs. non-significant pairs. The above result allows us to hypothesize that Different light verbs, even with its weak selectional features, can be identified and differentiated by contrasting distributional tendencies Variants of the same language, however, do not show contrasting tendencies but can be differentiated by existence (i.e. significant vs. non-significant) of some distributional tendencies
References Arppe, A. (2008) Univariate, bivariate and multivariate methods in corpus-based lexicography – a study of synonymy. Publications of the Department of General Linguistics, University of Helsinki, No. 44. URN: http://urn.fi/URN:ISBN:978- 952-10-5175-3. Arppe, A. (2009) Linguistic choices vs. probabilities – how much and what can linguistic theory explain? In: Featherston, S. & S. Winkler (eds.) The Fruits of Empirical Linguistics. Volume 1: Process. Berlin: de Gruyter, pp. 1–24. Arppe, A. (in prep.) Solutions for fixed and mixed effects modeling of polytomous outcome settings. Han, Weifeng, Arppe, Antti & Newman, John (2013). Topic marking in a Shanghainese corpus: from observation to prediction. Corpus Linguistics and Linguistic Theory (preprint). Butt, M., & Geuder, W. (2001). On the (semi) lexical status of light verbs. Semi- lexical Categories, 323-370. Cattell, R. (1984). Composite Predicates in English. Syntax and Semantics Volume 17. Sydney: Academic Press Australia. Cai, Wenlan. (1982). Issues on the Complement of ‘jinxing’ (“ 進行 ” 帶賓問題 ). Chinese Language Learning ( 漢語學習 ) (3), 7-11.
References Huang, Chu-Ren and Jingxia Lin. (2013). The ordering of Mandarin Chinese light verbs. Proceedings of the 13th Chinese Lexical Semantics Workshop. D. Ji and G. Xiao (Eds.): CLSW 2012, LNAI 7717, pp. 728-735. Heidelberg: Springer. Huang Chu-Ren, Jingxia Lin, and Huarui Zhang (2013). World Chineses based on comparable corpus: The case of grammatical variations of jinxing. 《澳门语言文化研究》, 397-414. Jespersen, O. (1965). A Modern English Grammar on Historical Principles. Part VI, Morphology. London: George Allen and Unwin Ltd. Zhou, Gang. (1987a). Subdivision of Dummy Verbs ( 形式動詞的次分類 ). Chinese Language Learning ( 漢語學習 ), 1, 11-14. Zhou, Xiaobing. (1987b). Sentence Pattern Comparison of ‘jinxing’ and ‘jiayi’ (“ 進行 ”“ 加以 ” 句型比較 ). Chinese Language Learning ( 漢語學習 ), 6, 1-5. Zhu, Dexi. (1985). Dummy Verbs and NV in Modern Chinese ( 現代書面漢 語里的虛化動詞和名動詞 ). Journal of Peking University (Humanities and Social Sciences) ( 北京大學學報 ( 哲學社會科學版 )), 5, 1-6.