Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of.

Similar presentations


Presentation on theme: "Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of."— Presentation transcript:

1 Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of Linguistics Faculty of Humanities and Social Sceinces, University of Zagreb {kvuckovi, zagic, marko.tadic}@ffzg.hr FASSBL 7 Conference Dubrovnik, Croatia 2010-10-05

2 Overview What? classifying Croatian sentences by structure detecting independent and dependent clauses How? implemented a prototype system in NooJ linked it with a morphosyntactic tagger evaluated on a sample from Croatian corpora Why? rule-based chunking and shallow parsing

3 Classification and detection sentence segmentation is easy when considering sentence boundaries only here, we: detect boundaries of clauses in complex sentences assign type to sentences sentence classification purpose: declarative, interrogative, etc. structure: simple and complex complex sentences independent complex, i.e. compound sentences dependent complex sentences

4 Classification and detection independent complex sentences independent clause connected to the main clause by using a conjunction type defined by the choice of conjunction e.g. constituent clause, conjunctions {i, pa, te, ni, niti} disjunctive, opposite, exclusive, conclusive and explanatory clause Svi su spavali, jedino sam ja bio budan. (exclusive) dependent complex sentences main clause is independent, all the others depend on it and cannot stand alone in a sentence Predicative, subjective, objective, attributive, appositional and adverbial clause Ispričat ću ti što mi se dogodilo. (objective)

5 The system prototype implemented in NooJ finite state transducer cascades (local grammars) Croatian lexical resources each cascade detects and annotates a different type of clause built on top of a chunker for Croatian the top-level grammar two types of subgraphs: main clauses and independent clauses

6 The system Main clause grammar presence of a VP and possibly any other phrase independent clauses recognized just by using the conjunctions implementation of dependent clause detection varies across clause types

7 Experiment setup used the CW100 corpus XCES-encoded to word level sentence delimited, tokenized, manually lemmatized and MSD-annotated 200 randomly selected sentences 100 for the development and 100 for testing utilized the CroTag tagger NooJ input format allows external annotation created three systems no preprocessing tagging input sentences with CroTag (~85% accuracy) using the manually assigned tags from CW100 recall, precision, F1-measure

8 Results scores for the three systems perfect tagging system is the top-performer benefits of automatic tagging? distribution of assigned types main, objective, opposite, adverbial, attribute,... misclassifications attributive and objective most commonly misclassified data sparseness No taggingCroTagManual tagging Recall 0.7300.6570.813 Precision 0.8900.8990.928 F1-measure 0.8020.7590.867

9 Conclusions and future work the system scores good in terms of F1-measure open issues verb coordination dislocated nominal predicates attribute classes starting with a PP complex insertion of dependent clauses no real benefit from automatic MSD-tagging future work resolving the issues re-evaluation on a larger test set? integration with a rule-based shallow parser

10 Thank you for your attention. The research within the project ACCURAT leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007- 2013), grant agreement n o 248347. www.accurat-project.eu


Download ppt "Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of."

Similar presentations


Ads by Google