Presentation is loading. Please wait.

Presentation is loading. Please wait.

Michal Křen Institute of the Czech National Corpus Charles University, Prague SLAVICORP Warszawa, 22 November 2010 Accessing the.

Similar presentations


Presentation on theme: "Michal Křen Institute of the Czech National Corpus Charles University, Prague SLAVICORP Warszawa, 22 November 2010 Accessing the."— Presentation transcript:

1 Michal Křen michal.kren@ff.cuni.cz Institute of the Czech National Corpus Charles University, Prague SLAVICORP Warszawa, 22 November 2010 Accessing the Czech National Corpus

2 Outline of the talk  The Czech National Corpus (CNC)  Available CNC corpora  Accessing the CNC  Demonstration

3 The Czech National Corpus long-term project aiming (not only) at continuous mapping of contemporary Czech compilation, maintenance and providing public access to various corpora: synchronic / diachronic written / spoken monolingual / multilingual corpus hosting is a service provided by the ICNC to institutions that compile corpora, but lack capacities and / or appropriate know-how for technical processing (format conversion, unification, quality control etc.) public release, server maintenance etc. balanced / not balanced (large) general / specialised corpora CNC-compiled / hosted corpora

4 Available CNC corpora the size is given in words proper (excluding numbers and punctuation) the balanced SYN-series corpora: cover consecutive time periods, aim to represent written language of that period, emphasis on variability of sources Synchronic written corpora (the SYN-series) size (# of words) lemmatisation & tagging SYN2010 100 mil.yesbalanced corpus, most of the texts from 2005 - 2009 SYN2009PUB 700 mil.yesnewspapers and magazines from 1995 - 2007 SYN2006PUB 300 mil.yesnewspapers and magazines from 1990 - 2004 SYN2005 100 mil.yesbalanced corpus, most of the texts from 2000 - 2004 SYN2000 100 mil.yesbalanced corpus, most of the texts from 1990 - 1999 General features disjoint, i.e. any document can be included only into one of them invariable entities once published, identical queries always give identical results processing differences - lemmatisation, tagging, segmentation etc. => super-corpus SYN: unification of all the SYN-series corpora, updated when needed, consistently re-processed with state-of-the-art versions of available tools; the total size of SYN will thus soon reach 1.3 billion words proper

5 Available CNC corpora Diachronic corpora size (# of words) lemmatisation & tagging DIAKORP 1 600 000not yetcorpus of old Czech texts (14th - 19th century) DOTKO 12 000 000nocorpus of Lower Sorbian texts (mostly 1848 -1933) size (# of words) lemmatisation & tagging ORWELL 80 000manualOrwell's "1984", manually annotated FSC2000 100 000 000manual / nomodified SYN2000, base of the Frequency Dictionary of Czech KSK-DOPISY 800 000notranscriptions of handwritten correspondence from 1990 - 2004 LINK 1 900 000yescorpus of academic texts from the linguistic domain Synchronic written corpora continued - specialised and hosted corpora size (# of words) lemmatisation & tagging ORAL2008 1 000 000not yetbalanced corpus of informal spoken Czech from 2002 - 2007 ORAL2006 1 000 000not yetcorpus of informal spoken Czech from 2002 - 2006 SCHOLA 790 000notranscriptions of school lessons from 2005 - 2008 BMK 490 000noinformal / semi-formal spoken Brno Czech from 1994 -1999 PMK 675 000manualinformal / semi-formal spoken Prague Czech from 1988 -1996 Synchronic spoken corpora

6 InterCorp aims at building a large parallel synchronic corpus covering a number of languages: bg da de en es fi fr hr hu it lt lv nl no pl pt ro ru sk sl sr sv Czech is the pivot language mostly fiction with manually corrected alignment supplemented by automatically aligned political commentaries published by Project Syndicate (de en es fr ru); more sources in the future (Presseurop.eu) lemmatisation and/or tagging where possible: bg de en es fr hu it lt nl no pl ru sk incremental (not invariable), its size and the number of languages are growing currently searchable 49 million foreign-language words in aligned segments another 19 million words prepared for publication The CNC tasks project administration, central coordination and funding central data storage, standardisation, data processing, quality assurance support to the coordinators of individual languages (manuals, tutorials etc.); the coordinators choose and supervise their own collaborators (mostly students) development of special software, mainly search interface (Park) and central database including text alignment and administration tools (InterText) Credits: Alexandr Rosen, Michal Štourač, Martin Vavřín, Pavel Vondřička et al.

7 Accessing the CNC server Manatee (Pavel Rychlý, FI MU Brno) + fast, powerful query language, GNU GPL - no documentation monolingual clients Bonito 1 (Pavel Rychlý, FI MU Brno) - stand-alone application, Tcl/Tk + various functions, very popular, GNU GPL - old architecture: requires installation, 5016 port, no Unicode support Bonito 2 or The Sketch Engine (Pavel Rychlý, FI MU Brno) - web-based, Python + does not require installation, http, supports Unicode - lacks some functionality, confused interface, unclear licensing multilingual client Park (Michal Štourač, ICNC) - web-based, Python + does not require installation, http, supports Unicode - being developed, lacks a lot of functionality ( => non-parallel versions of the InterCorp texts are made accessible also via Bonito 2)

8 Thank you!


Download ppt "Michal Křen Institute of the Czech National Corpus Charles University, Prague SLAVICORP Warszawa, 22 November 2010 Accessing the."

Similar presentations


Ads by Google