Download presentation
Presentation is loading. Please wait.
Published byAmber Perkins Modified over 11 years ago
1
Direkt Profil: an automatic analyzer of texts written in French as a second language Jonas Granfeldt(1), Pierre Nugues(2), Suzanne Schlyter(1), Malin Ågren (1), Edin Kukovic (1), Emil Persson (1), Jonas Thulin (2), Lisa Persson (2), Fabian Kostadinov (3) (1) Lund University, Centre for languages and literature, French (2) Lund Institute of Technology, Department of Computer Science (3) University of Zürich, Department of Computer Science http://profil.sol.lu.se Jonas.Granfeldt@rom.lu.se
2
OUTLINE Introduction –The idea –Rationale –The knowledge bases –Demo Theoretical background –Developmental sequences and developmental stages in L2 French Method –CEFLE - The development corpus –The Direkt Profil system Overview of the system Annotation Defining profiles/stages with machine learning Results Annotation Defining profiles/stages Example of an applied study with Direkt Profil –Direkt Profil and teachers assessments: a correlation study Conclusion –Problems –Future work
3
The idea was… –To provide researchers, teachers and learners with an easy-to-use tool for overall diagnostic assessment of developmental stage. –To base the assessment on current research on second language acquisition. –To automatically provide feedback to teachers and learners on language level and central target features of the language. –To use learners free written production as the basis of assessment (rather than close-tests) INTRODUCTION
4
Rationale Language acquisition is a process which follows a specific and definable order. Learners and teachers want to know about the progress the learners make. Instruction is probably most effective if it is adopted to the learners present developmental level (cf. The Teachability Hypothesis, Pienemann, 1985) INTRODUCTION
5
The knowledge bases for the project Second Language Research Linguistics (French) Natural Language Processing Engineering INTRODUCTION
6
An example: a learner text from the corpus C'est deux personne, une fille et sa mère. La fille est grand et elle a une robe blue. Sa mère est petite mais grosse et elle a une robe vert. Elles va à L'Italie dans ses vacances. La fille pense à les garcons italien et sa mere pense du soleil. Elles sont derière un table avec une map. Elles boire des café. Leur voiture est vert. La voiture est trés petite est la bagage n'est pas fit. Maintenant elles à destination D'Italie. Elles check in. Le monsiuer fait une ronde tête est une grand moustache. Leur chambre est beau avec deux lis est une trés beaux vue. Elle est sur la plage. Sur la mere il y a des bateaux. Elles fait du soleil. Dans la soir elle a dîner dans une restaurant. À côté il y a un garcon avec une costume blue. Après le diner elles boire du vin rouge dans la bar. Les deux garcon d'italien ils voir la mère et sa fille. Ils sont d'amour. Ils parlent et boire de alcohol. Aprés ils fait du dancing. Le jour aprés ils fait du sightseeing avec Tony et son autobus rouge. Il est bold. Après le sightseeing ils visite un marche. La dame grosse a une hat rouge. Le monsieur grand a un hat noir. La fille grand amour le garcon petite mais grosse. Sur le soir ils separé - le grand monsieur avec la petite mais grosse dame et la grand fille avec le petite mais trés grand monsieur. Le jour après ils revenir a Suede avec les deux monsieurs. INTRODUCTION
7
DEMO HERE INTRODUCTION
8
French L2 in a developmental perspective Many projects since 1980s (examples) –ESF-project (Perdue, 1993, L2 French, different L1s) –InterFra project (Bartning, 1997 and later) (Swedish L1) –FIFI/DURS project (Schlyter, 1986 and later, Granfeldt, 2003) (Swedish L1) –Myles & Mitchell Myles (2002 and later) (Flloc-project, English L1) Empirical objectives of this research: –arrive at rich and empirically valid descriptions of how French interlanguage develops over time. –identify features at different linguistic levels which are developmentally related. Some syntheses are emerging: –Bartning & Schlyter (2004): A proposal of six stages of development. –Véronique et al. (2009): A proposal of three stages THEORETICAL BACKGROUND
9
Benchmarking grammatical development of French L2 (Bartning & Schlyter, 2004) Objectives: Describe developmental sequences in French L2 for a number of morphosyntactic phenonema Establish general learner stages/profiles wrst to grammatical development Data: Oral corpora of French L2 (L1 = Swedish). Post-puberty learners (N=35, 80 recordings) Method: Frequency analysis and linguistic profiling Manual and semi-automated tagging of transcriptions THEORETICAL BACKGROUND
10
InitialAdvancedIntermediate Granfeldt (2003); Bartning & Schlyter (2004) A model with 6 profiles/stages (sample)
11
Direkt Profil Objectives: To implement the model of Bartning & Schlyter (2004) To develop an easy-to-use system for automated annotation, extraction and frequency analysis of as many as possible of the features in B&S work To develop a system for defining developmental stages/profiles Method: Constructing an interlanguage partial parser for L2 French Connecting the parser to a module for machine learning Constructing an interface We have expanded on B&S original work wrst : Type of data (written rather than oral) Quantity of data Additional features (more morph.synt. features, lexical and quantitative features) METHOD
12
Overview of Direkt Profil
13
The development corpus CEFLE CEFLE: Corpus Ecrit de Français Langue Etrangère 400 texts written under controlled conditions by 85 Swedish and 22 French students (317 texts used here) 4 texts / learner. Manual assignment of stage to one text from each learner using B&S criteria (Voyage en Italie) Granfeldt, Nugues et al. (2006)
14
ANNOTATION We developed an annotation scheme based on B&S (2004) framework. The concepts of noun or verb group is the grammatical representation of most phenomena in this framework. Essential to the Direkt Profil annotation Many syntactic annotation frameworks for French take this into consideration –An example from Gendner et al. (2004): et mademoiselle qui appelait au secours !... ou plutôt non, on ne l' entendait plus... elle était peut-être morte... This annotation make no provision however for the specific details in B&S framework
15
ANNOTATION (contd) The Direkt Profil annotation is an XML-based mark up, split into 5 levels: 1.Tokenisation 2. Identification of prefabricated structures (cest; je mappelle etc)
16
ANNOTATION (contd) 3) POS-tagging (Det, Prep, Pron, V(être/avoir), Konj) 4)Groupe detection/chunking: rule-based (decision tree) and uses a set of grammatical words (« mots vides », Tesnière, 1959; Vergne, 1998) 5)Chunk classification: rule-based feature checking between elements.
17
The sentence Ils parlons dans la bar is annotated as Ils parlons dans la bar c5148 reads: Lexical verb/Present tense/3rd.pers.PL/no_agreement c3071 reads: Det_Noun_NP/singular_det/without_gender_agreement –Features are finally counted and raw occurrences are converted to percentages (where relevant)
18
The dictionary The engine uses a dictionary of French inflected forms available freely from Association des Bibliophiles Universels (ABU) We have corrected, complemented it, and converted it to XML. We have also added frequency-of-use information from the Lexique database (New, Pallier & Ferrand, 2005)
19
DEFINING STAGES/PROFILES Using the criteria in Bartning & Schlyter (2004) two researchers manually classified 82 texts of the sub-corpus Le voyage en Italie (part of CEFLE). The classification was subsequently re-used with all texts from the same learner, resulting in 317 classified texts. We trained/build classifiers where we used automatically extracted phenomena as features representing the learners texts. Currently 142 phenomena (features/attributes) are used when establishing a learner profile stage. We used C4.5 (Quinlan, 1986), LMT (Landwehr et al., 2003), and Support Vector Machines (Boser et al al., 1992) from the Weka collection (Witten & Frank, 2005)
20
RESULTS
21
Annotation Granfeldt, Nugues et al., 2005 RESULTS
22
CLASSIFICATION using all features Granfeldt & Nugues, 2007 RESULTS
23
A sample decision tree % NPs with gender agreement <= 93 | % nominative pronouns <= 4: 1 (7.0/1.0) | % nominative pronouns > 4 | | % NPs with num+gen agreement <= 94: 1 (2.0) | | % NPs with num+gen agreement > 94 | | | % pluperfect verbs in S-V agreement <= 0 | | | | S-V agreement w/ modal verbs <= 10 | | | | | Average sentence length <= 15 | | | | | | % of the next 2,000 words <= 0: 1 (2.0/1.0) | | | | | | % of the next 2,000 words > 0 | | | | | | | % D-N-A in agreement <= 0: 2 (11.0) | | | | | | | % D-N-A in agreement > 0 | | | | | | | | % D-A-N in agreement <= 50 | | | | | | | | | % of the next 2,000 words <= 1: 2 (8.0/1.0) | | | | | | | | | % of the next 2,000 words > 1 | | | | | | | | | | % prepositions <= 9 | | | | | | | | | | | % vbs in the imperfect <= 0 | | | | | | | | | | | | % mod+inf verbs in S-V agreement <= 33: 2 (4.0) | | | | | | | | | | | | % mod+inf verbs in S-V agreement > 33: 3 (3.0/1.0) | | | | | | | | | | | % vbs in the imperfect > 0: 3 (2.0)
24
Attribute selection We ran an attribute selection procedure in order to identify the best features at this point. To evaluate the 142 attributes, we measured the information gain for each attribute with respect to the class. This method is derived from ID3 and is part of the Weka software. Top 10 features according to InfoGain metric Average meritFeature 0.4371 % Determiner Noun agreement (gender errors) 0.3351 % Unknown words (i.e. not in dictionary) 0.3232 % NPs with gender agreement (including adjectives) 0.2925 Average sentence length 0.2565 % Prepositions (out of all parts-of-speech) 0.2082 % S-V agreement with modal verbs followed by infinitive 0.1953 % Noun Adjective with agreement (gender and number) 0.1793 % S-V agreemet w auxiliary in passé composé 0.1739 % S-V agreement with être/avoir 3ppl (all tenses) 0.153 % K1Tokens (out of all tokens) Granfeldt & Nugues, 2007
25
Results after feature selection (top 20 attributes) Granfeldt & Nugues, 2007
26
Direkt Profil and teachers assessment: a correlation study An example of an applied study with Direkt Profil Several scholars have suggested that work on developmental sequences and stages could be used as a mean for assessing language development of a particular individual at a given time (Clahsen, 1985, Pinemann & Johnston, 1987, the Rapid Profile program Pinemann & Mackay, 1992, Brindley, 1998)
27
Research questions 1.What is the correlation between the developmental stage and teachers assessments of the same texts? (RQ1) 2.To what extent can the developmental stage predict teachers ranking of a particular text? (RQ2)
28
Method -50 texts from the CEFLE- corpus (Ågren, 2005) were selected (Task: Le voyage en Italie picture series) -The learner texts had previously been manually analysed according to developmental stage following the criteria in B&S Stage 1 (man) Stage 2 (man) Stage 3 (man) Stage 4 (man) Natives 10 texts The texts were also analysed by Direkt Profil resulting in two separate indications for developmental stage (manual and automated)
29
Method (contd) 7 experienced teachers of upper secondary school rated the 50 texts on a six grade scale (6 = highest level) They were asked to assess the texts in three domains: (a)Form, i.e. language (grammar, lexicon, spelling etc.) (b)Content and Communication (content in relation to the pictures, the communicative success of the text) (c)Overall, i.e. combining a and b (in a way they found suitable) The teachers also stated for each assessment the degree of certainty with which they had rated the text (scale of 5 where 5 indicated completely certain and 1 indicated completely uncertain)
30
RESULT: Median and distribution of ratings for form (language) Granfeldt & Ågren, 2009
31
RESULT Inter-rater agreement between teachers Granfeldt & Ågren, 2009
32
RESULT: Correlating developmental stage and teachers assessments Answering Research Question 1: The developmental stage is better correlated with the assessments of the teachers than instructional level. Granfeldt & Ågren, 2009
33
RESULT: Regression analysis Apprx. 70% of the variance in the teachers ranking of the texts can be explained by the developmental stage as analysed by Direkt Profil Answering Research Question 2:
34
Conclusion We have presented a system for assessment of developmental stage/profile in French as a second language French. –The system implements the current theory of stages/profiles of development in French. The system consists of –a interlanguage partial parser for French L2 called Direkt Profil and –a machine-learning module connected to it. Results: –An evaluation of the annotation showed mixed results, depending very much on the developmental stage of the writer. –Results from classification experiments show: Best results with a 3-stage classification: a mean F of 0.82 Stage 1 is the most problematic The texts from the natives are relatively easy to classify: a mean F of 0.91 A large feature set does not seem to be necessary (at least not for this data) Using an attribute/feature selection method, we have identified a list of 10 best attributes
35
Problems Briefly, the language produced by learners is about the worst imaginable type of language for NLP. (Tschichold, 2007) –Lexical spelling (orthographe lexicale) is a problem – incorrect forms lead to increased ambiguity and to incorrect annotation –Attribute selection is not sufficiently studied. –Amount of data is still insufficient.
36
Future work Optimising annotation: Procedures to adress the spelling problem Review the rules Ongoing student tests with a stochastic parser (trained on the Le monde corpus) Adding more texts from higher stages of development Expanding to other languages (Italian L2) Continue working with other assessment schemes, i.e. the Common European Framework of Reference (Granfeldt, 2008)
37
Thank you for your attention! Direkt Profil is free to use Available at this adress: –http://profil.sol.lu.se Acknowledgments The profiling team in Lund: Pierre Nugues, Suzanne Schlyter, Malin Ågren, Edin Kuckovic, Emil Persson, Fabian Kostadinov, Lisa Persson This work was supported by the Swedish Research Council Grant number 2004-1674
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.