A CORPUS-BASED STUDY OF REFERENTIAL CHOICE: Multiplicity of factors and machine learning techniques Andrej A. Kibrik, Grigorij B. Dobrov, Mariya V. Khudyakova,

A CORPUS-BASED STUDY OF REFERENTIAL CHOICE: Multiplicity of factors and machine learning techniques Andrej A. Kibrik, Grigorij B. Dobrov, Mariya V. Khudyakova, Natalia V. Loukachevitch, and Aleksandr Pechenyj aakibrik@gmail.com

2 22 Referential choice in discourse  When a speaker needs to mention (or refer to) a specific, definite referent, s/he chooses between several options, including:  Full noun phrase (NP) Proper name (e.g. Pushkin) Common noun (with or without modifiers) = definite description (e.g. the poet)  Reduced NP, particularly a third person pronoun (e.g. he)

3 Example  Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done  How is this choice made?  Why does speaker/writer use a certain referential option in the given context? Full NP Pronoun antecedent anaphors

4 Why is this important?  Reference is among the most basic cognitive operations performed by language users  Reference constitutes a lion’s share of all information in natural communication  Consider text manipulation according to the method of Biber et al. 1999: 230-232

5 Referential expressions marked in green  Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done

6 Referential expressions removed  Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done

7 Referential expressions kept  Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done

9 99 Plan of talk  I. Referential choice as a multi-factorial process  II. The RefRhet corpus  III. Machine learning-based approach  IV. The probabilistic character of referential choice

10 I. MULTI-FACTORIAL CHARACTER OF REFERENTIAL CHOICE  Multiple factors of referential choice  Distance to antecedent  Along the linear discourse structure (Givón)  Along the hierarchical discourse structure (Fox)  Antecedent role (Centering theory)  Referent animacy (Dahl)  Protagonisthood (Grimes)......................................... Properties of the discourse context Properties of the referent

11 What shall we do with that?  Many authors have tried to emphasize one of these factors in particular studies  But none of those factors can explain everything: sometimes factor A is more relevant, sometimes factor B, etc.  One must recognize the inherently multi-factorial character of referential choice  Factors must be somehow integrated  Previous attempts of such integration  The calculative model (Kibrik 1996, 1999)  The neural networks study (Gruening and Kibrik 2005)

12 The calculative approach  Each value of each factor is attributed a numerical value  For each referent occurrence, all factor values are easily identifiable, and therefore all the corresponding numerical values are readily available  At every pointŒ in discourse all factors’ contributions are summed and give rise to an integral characterization – the referent’s activation score  Activation score can be understood  In a more cognitive way, that is as the referent’s status with respect to the speaker’s working memory  In a more superficial way, as a conventional integral characterization of the referent vis-à-vis referential choice  Activation score predetermines referential choice  Low  full NP  Medium  full or reduced NP  High  reduced NP

13 Multi-factorial model of referential choice (Kibrik 1999) Various properties of the referent or discourse context Referent’s activation score Referential choice Relevant factors

14 The neural networks approach  Neural networks  Machine learning algorithm Automatic selection of factors’ weights Automatic reduction of the number of factors («pruning»)  However: Small data set Single method of machine learning Interaction between factors remains covert  Hence a new study  Large corpus  Implementation of several machine learning methods  Statistical model of referential choice

15 II. THE RefRhet CORPUS  English  Business prose  Initial material – the RST Discourse Treebank  Annotated for hierarchical discourse structure  385 articles from Wall Street Journal  The added component – referential annotation  The RefRhet corpus  About 30 000 referential expressions  157 texts are annotated twice  193 texts are annotated once  Why this particular corpus?

16 Example of a hierarchical graph, with rhetorical distances RhD = 1 LinD = 4 LinD = RhD = 2

17 Scheme of referential annotation  The ММАХ2 program  Krasavina and Chiarcos 2007  All markables are annotated, including:  Referential expressions  Their antecedents  Coreference relations are annotated  Features of referents and context are annotated that can potentially be factors of referential choice

19 Work on referential annotation  O. Krasavina  A. Antonova  D. Zalmanov  A. Linnik  M. Khudyakova  Students of the Department of Theoretical and Applied Linguistics, MSU

20 Current state of the RefRhet referential annotation  2/3 completed  Further results are based on the following data:  247 texts  110 thousand words  26 024 markables  4291 reliable pairs «anaphor – antecedent» Proper names — 43% Definite descriptions — 26% Pronouns — 31% 69%

21 Factors of referential choice (2010)  Properties of the referent:  Animacy  Protagonisthood  Properties of the antecedent:  Type of syntactic phrase (phrase_type)  Grammatical role (gramm_role)  Form of referential expression (np_form, def_np_form)  Whether it belongs to direct speech or not (dir_speech)

22 Factors of referential choice (2010)  Properties of the anaphor:  First vs. nonfirst mention in discourse (referentiality)  Type of syntactic phrase (phrase_type)  Grammatical role (gramm_role)  Whether it belongs to direct speech or not (dir_speech)  Distance between the anaphor and the antecedent:  Distance in words  Distance in markables  Linear distance in clauses  Hierarchical distance in elementary discourse units

23 Factors 2011  Gender and number (agreement): masculine, feminine, neuter, plural  Antecedent length, in words  Number of markables from the anaphor back to the nearest full NP antecedent  Number of referent mention in the referential chain  Distance in sentences  Distance in paragraphs

24 III. MACHINE LEARNING: TECHNIQUES AND RESULTS  Independent variables:  All potential activation factors implemented in corpus annotation  Dependent variable:  Form of referential expression (np_form)  Binary prediction:  Full NP vs. pronouns  Three-way prediction:  Definite description vs. proper name vs. pronoun  Accuracy maximization:  Ratio of correct predictions to the overall number of instances

25 Machine learning methods (Weka, a data mining system)  Easily interpretable methods:  Logical algorithms Decision trees (C4.5) Decision rules (JRip)  Logistic regression  Quality control – the cross-validation method

26 Examples of decision rules generated by the JRip algorithm  (Antecedent’s grammatical role = subject) & (Hierarchical distance ≤ 1.5) & (Distance in words ≤ 7) => pronoun  (Animate) & (Distance in markables ≥ 2) & (Distance in words ≤ 11) => pronoun 26

27 2010 results with single machine-learning algorithms  Accuracy  Binary prediction:  logistic regression – 85.6%  logical algorithms – up to 84.5%  Three-way prediction:  logistic regression – 76%  logical algorithms – up to 74.3% 27

28 Composition of classifiers: boosting  Base algorithm (C4.5 Decision trees)  Iterative process  Each additional classifier applies to the objects that were not properly classified by the already constructed composition  At each iteration the weights of each wrongly classified object increase, so that the new classifier focuses on such objects

29 Composition of classifiers: bagging  Base algorithm (C4.5 Decision trees)  Bagging randomly selects a subset of the training samples to train the base algorithm  Set of algorithms built on different, potentially intersecting, training subsamples  A decision on classification is done through a voting procedure in which all the constructed classifiers take part

30 Binary prediction: full noun phrase vs. pronoun AlgorithmAccuracy 2010 Accuracy 2011 Logistic regression Decision tree algorithm Deciding rules algorithm Boosting Bagging 85.6% 84.3% 84.5% 87.0% 86.3% 86.2% 89.9% 87.6%

31 Three-way prediction: description vs. proper name vs. pronoun AlgorithmAccuracy 2010 Accuracy 2011 Logistic regression Decision tree algorithm Deciding rules algorithm Boosting Bagging 76.0% 74.3% 72.5% 77.4% 76.7% 75.4% 80.7% 79.5%

32 Comparison of single- and multi-factor accuracy FeatureBinary predictionThree-way prediction The largest class69%43% Distance in words76%55% Hierarchical distance74.8%53.5% Anaphor’s grammatical role 70%45.2% Anaphor in direct speech 70%43.8% Animate71.5%47.3% Combination of factors 89.9%80.7%

33 Significance of factors in the three- way prediction Factors Accuracy All factors, including the newly added ones without the anaphor’s grammatical role without the antecedent’s grammatical role without grammatical role without the antecedent’s referential form without protagonism without animacy 80.7% 79.3% 80.2% 79.2% 77.0% 80.0% 80.68%

34 Significance of factors in the three-way task of referential choice (continued) Factors – distances (6) Accuracy All factors, including the newly added ones without all distances - except for rhetorical distance only - except for the distance in words only - except for the distances in words and paragraphs - except for the distances in words and sentences - except for rhetorical distance and the distances in words and sentences - except for the distances in words, markables, and paragraphs 80.7% 73.5% 74.9% 79.0% 79.5% 79.7% 80.47%

35 IV. REFERENTIAL CHOICE IS A PROBABILISTIC PROCESS  According to Kibrik 1999 Potential referential expressions Actual referential expressions Full NP only (19%) Full NP (49%) Full NP, ?pronoun (21 %) Pronoun or full NP (28%) Pronoun (51%) Pronoun, ?full NP (23%) Pronoun only (9%)

36 Probabilistic character of referential choice in the RefRhet study  Prediction of referential choice cannot be fully deterministic  There is a class of instances in which referential choice is random  It is important to tune up the model so that it could process such instances in a special manner  We are beginning to explore this problem  Logistic regression generates estimates of probability for each referential option  This estimate of probability can be interpreted as the activation score from the earlier model

37 Probabilistic multi-factorial model of referential choice Activation score = probability of using a certain referential expression Referential choice Relevant factors Various properties of the referent or discourse context

38 Conclusions about the RefRhet study  Quantity: Large corpus of referential expressions  Quality: A high level of accurate prediction is already attained  And we keep working!  Theoretical significance: the following fundamental properties of referential choice are addressed:  Multi-factorial character of referential choice  Contribution of individual factors, assessed automatically statistically in a variety of ways  Probabilistic character of referential choice  This approach can be applied to a wide range of linguistic and other behavioral choices

39 Thank you in the CML languages  cпасибо  благодаря  хвала  mulţumesc  ευχαριστώ

40 5 th International conference on cognitive science See www.conf.cogsci.ruwww.conf.cogsci.ru Abstract submission: between October 1 and November 15

A CORPUS-BASED STUDY OF REFERENTIAL CHOICE: Multiplicity of factors and machine learning techniques Andrej A. Kibrik, Grigorij B. Dobrov, Mariya V. Khudyakova,

Similar presentations

Presentation on theme: "A CORPUS-BASED STUDY OF REFERENTIAL CHOICE: Multiplicity of factors and machine learning techniques Andrej A. Kibrik, Grigorij B. Dobrov, Mariya V. Khudyakova,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A CORPUS-BASED STUDY OF REFERENTIAL CHOICE: Multiplicity of factors and machine learning techniques Andrej A. Kibrik, Grigorij B. Dobrov, Mariya V. Khudyakova,

Similar presentations

Presentation on theme: "A CORPUS-BASED STUDY OF REFERENTIAL CHOICE: Multiplicity of factors and machine learning techniques Andrej A. Kibrik, Grigorij B. Dobrov, Mariya V. Khudyakova,"— Presentation transcript:

Similar presentations

About project

Feedback