Presentation is loading. Please wait.

Presentation is loading. Please wait.

CKL --- Center for Computational Linguistics Project MŠMT LC536 (LC05) Univerzita Karlova v Praze, ÚFAL MFF Západočeská univerzita Plzeň, KKY FAV Masarykova.

Similar presentations


Presentation on theme: "CKL --- Center for Computational Linguistics Project MŠMT LC536 (LC05) Univerzita Karlova v Praze, ÚFAL MFF Západočeská univerzita Plzeň, KKY FAV Masarykova."— Presentation transcript:

1 CKL --- Center for Computational Linguistics Project MŠMT LC536 (LC05) Univerzita Karlova v Praze, ÚFAL MFF Západočeská univerzita Plzeň, KKY FAV Masarykova Univerzita Brno, FI Ústav pro jazyk český AV ČR Praha

2 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 2 Center’s Advisory Board Meeting MFF UK, Malostranské nám. 25 Room S1, 4 th floor 10:00 Introduction to the Center, history, results (Jan Hajic) 10:00 Introduction to the Center, history, results (Jan Hajic) 10:25 Charles University research and results (Jan Hajic) 10:25 Charles University research and results (Jan Hajic) 10:40 Break 10:40 Break 11:00 Institute for Czech Language research and results (Karel Oliva) 11:00 Institute for Czech Language research and results (Karel Oliva) 11:15 Masaryk University research and results (Karel Pala) 11:15 Masaryk University research and results (Karel Pala) 11:30 University of West Bohemia research and results (Pavel Ircing) 11:30 University of West Bohemia research and results (Pavel Ircing)

3 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 3 The Center Goals: Goals: –Research in all areas of computational linguistics and speech –Close cooperation in speech and langauge –Create annotated data –Algorithms and SW Tools for NL analysis and generation –Create and integrate lexical resources

4 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 4 History of the Center Former Center for Computational Linguistics (program MŠMT LN) Former Center for Computational Linguistics (program MŠMT LN) – –UK, ÚJČ, ZČU: fundamental research type (B) Now: Center for Computational Linguistics Now: Center for Computational Linguistics –(again) fundamental research, MŠMT LC –Masaryk University in Brno added, now 4 sites

5 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 5 The Center: some figures Budget and timeframe Budget and timeframe –2.9 mil. €, [-2011] (6 yrs + 9 mos) Personální obsazení (2010): Personální obsazení (2010): –1 PI (professor) –7 Co-PIs and key presons (full/assoc. prof.) –11 Postdocs (Ph.D.)  9 of them graduated with CKL support –24 graduate students Reduced to about 2/3 for 2011 Reduced to about 2/3 for 2011

6 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 6 The sites (1) UK Praha (ÚFAL MFF / Charles University) UK Praha (ÚFAL MFF / Charles University) –Formal language theory and algorithms –SW tools for NLU / NLG –Raw, Annotated data (incl. parallel) ZČU Plzeň, KKY FAV (University of West Bohemia in Pilsen) ZČU Plzeň, KKY FAV (University of West Bohemia in Pilsen) –Speech recognition and TTS –Data collection and annotation

7 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 7 The sites (2) MU Brno, FI, NLP lab (Masaryk University) MU Brno, FI, NLP lab (Masaryk University) –Lexical issues  Lexical databases, incl. SW ÚJČ AV ČR (Institute of the Czech Language, Academy of Sciences of the CR) ÚJČ AV ČR (Institute of the Czech Language, Academy of Sciences of the CR) –Digitization of historical data –Lexical databases

8 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) Start of work, after some “gap” Start of work, after some “gap” –Apr. 1, 2005 – three months vacuum –[Got back the name…] –Reduced budget for 2005 (300k €)  Durable equipment / future computing cluster –Cooperation:  EU grant proposals  continuing work on Malach (U.S.)  Start of the PIRE NSF project (JHU, Brown Univ.)

9 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) First full year First full year –Prague Dependency Treebank v2.0 finished (published at LDC) –Speech reconstruction project (UK, specification with PIRE/JHU) –Lexical issues (UK, MU, ÚJČ) –Speech (ASR, TTS - ZČU) –IR – CLEF test collection, CLEF shared task, 1st part –Digitization of historical material (ÚJČ) –Start of EU Integrated project „Companions“: UK, ZČU –More international cooperation: EU, USA (JHU, Brown, Univ. of Pennsylvania) –Organization of Treebanks and Linguistics Theories, Dec (UK) –40 „results” in the government database („RIV”)

10 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) Mid-project Mid-project –Lexical resources, new Czech language lexical database (MU+ÚJČ) –Added more students for English work, translation  English annotation specification, annotation (ZČU, UK) –Integration of ASR and TTS with NLU/NLG (UK, ZČU)  In the “Companions” project –SW tools for analysis and generation  Speech, language (UK, MU, ZČU) –International collaboration  EU (3 projects 6 th FP: UK, UK+ZČU), USA (UK, UK+ZČU) –Local organisation of ACL 2007 and EMNLP 2007  Still (2011) holds record in attendance (~1100 participants) –66 results in “RIV” (16 journals, 39 in-proc., 5 SW/data etc.)

11 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) Slightly modified goals (stress on MT) Slightly modified goals (stress on MT) –Lexical resources (MU, UK, ÚJČ)  SW tools –Semantics  detection of plagiarism (MU)  NLU (UK, MU), NLG (UK) –New algorithms for ASR  Prosody, language modeling, speech reconstruction –Data acquisition, annotation, corpus tools –Research (incl. data annotation) for machine translation  The TectoMT SW and data platform –Theoretical formal linguistics, language usage Results (RIV): 64: 13 journal art., 32 in-proc., 5 books, 5 SW tools/data resources etc. Results (RIV): 64: 13 journal art., 32 in-proc., 5 books, 5 SW tools/data resources etc.

12 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) Should have been the last year of CKL… Should have been the last year of CKL… –Application for extension for  Granted for 2010 –Research: English data, MT, ASR, Dialog  Work on the parallel Czech-English treebank (PTB)  Companions project: integration work –Tight cooperation between UK and ZCU  PIRE project – workshops, students from US at UK  Euromatrix EU project on MT extended (-2012) –Organization of the CoNLL 2009 shared task –Organization of session at FET 2009 (EU conference) –Results: 62, journals: 8, in-proc.: 42, 3 books etc.

13 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) Last fully-funded year: ext. to 2011 granted in Nov. Last fully-funded year: ext. to 2011 granted in Nov. –Continuation of research along the same lines  Wrap-up in data annotation: PCEDT, PDTSx  Departures of people due to uncertainty –International cooperation:  Companions project finished (Nov. 2010)  PIRE continuing towards 2011, EuromatrixPlus renewed (UK)  New projects in 2010: –Univ. of Pennsylvania – discourse representation, annotation (UK) –Khresmoi (EU IP) – medical IR and IE, UK –Faust (STREP, machine translation, UK) –META-NET network of excellence in MT / data sharing  Chairing the ACL 2010 conference (Uppsala, Sweden) –Results (prelim.): ~60 (12 journal articles, ~40 in-proc.)

14 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 14 Quantitative Summary of Results RIV (2010 pending) RIV (2010 pending) –274 records (+ ~ 60 in 2010) Mostly papers in proceedings of conferences and workshops Mostly papers in proceedings of conferences and workshops –ACL, EACL, NAACL, Coling, CoNLL; workshops –> 95% international, > 85% abroad Some journal articles Some journal articles –LNCS, IEEE Transactions, LRE, Czech ling. Journals (PBML, SaS – now in WoS) Software and data Software and data –Mostly „open source“; training, shared task (evaluation)

15 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 15 Most valued publications Papers Papers –Semi-supervised POS tagging (EACL 2009)  Best results in POS tagging so far, incl. English  Now taggers available in 5 languages –Extension of HVS Semantic Parser by Allowing Left-Right Branching (ICASSP 2008)  New result, drawing from S. Young’s work –Large-scale Semantic Networks: Annotation and Evaluation  NAACL 2009; in cooperation with Google Research (Zurich, K. Hall) –CoNLL 2009 Shared Task, CoNLL 2009  Overall task and system description Book Book –Valenční slovník českých sloves ( Valency Lexicon of Czech Verbs, Karolinum Press )  Electronic version available

16 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 16 Most valued data Corpora (language databases, publicly available) Corpora (language databases, publicly available) –Prague Dependency Treebank 2.0, Linguistic Data Consortium 2006 –Prague Czech-English Dependency Treebank, to appear in 2011  Penn Treebank & translation to Czech, with semantic annotation ~PDT/style –Czech Wordnet 1.0 (ELRA, 2008) –Sign Language, Audiovisual (ELRA, 2008) Test / shared task collections Test / shared task collections –CLEF 2006, 2007  Multilingual cross-langauge search competitions –Machine Translation Open Competition – EuroMatrix/Plus  Czech-English, German, French, Italian, Hungarian, Spanish –CoNLL Shared Task 2007, 2009  Dep. parsing, semantic role labeling (unified for 7 languages)

17 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 17 Most valued SW tools Software Software –Corpus manager (client/server) Bonito/Manatee  Worldwide use: ČNK, SNK; Hu, Hr, GB –Word Sketch Engine  Commercial use (Lexical Computing) –ComPOST  State-of-the-art POS tagger (Cz, En, Dutch, Swedish, Icelandic) –Syntactic dependency parser „MST“ (Czech)  With Univ. of Pennsylvania –Improved Czec ASR and Emotional TTS  Used in the Companions project –NLG and Dialogue Manager w/knowledge base  Also for the Companions project –The TectoMT SW and data handling platform  MT, dialogue systems (now any NLU/NLG processing -> “Treex”)

18 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 18 The Center provided… Material benefits Material benefits –3/4 of budget: personnel (mainly graduate students) –Generous travel money –Small equipment –Durable equipment – clusters ( CPUs)  Only in 2005/6 – need for renewal –Small indirect costs (< 12%, contribution of inst.) “intangible” benefits “intangible” benefits –(Sub)teams, even across institutions, flexible assignment of people to projects, –dissertations, one assoc. professor promotion

19 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 19 The Center had to work under certain “restrictions” Employment of graduate students, postdocs, supervision of graduate students Employment of graduate students, postdocs, supervision of graduate students –Now at all four sites (2009: 10/4/9/1)  Requirement: at least on site… → Check Requirement: Participation of students (Bc./Mgr./Ph.D.) Requirement: Participation of students (Bc./Mgr./Ph.D.) –Total: 41 students → Check –7 nationalities Students - after graduation - went to (e.g.)… Students - after graduation - went to (e.g.)… –Petr Němec (UK): TextKernel, Hol.; Kiril Ribarov (UK): ČEZ –Jan Romportl, Aleš Pražák: SpeechTech (spinoff, ZČU) –Vladimír Kadlec (MU Brno): Acision (GB) –Petr Pajas (UK): Google (Zurich) –Václav Novák (UK): Ministry of Interior, then a small startup –Former CKL (LN, 00-04): M. Čmejrek, J. Cuřín (UK): IBM Research (Yorktown, Prague)

20 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 20 “Restrictions” (cont.’d) Requirement: integration to EU “research space” Requirement: integration to EU “research space” 9 projects EU, 6 th and 7 th FP 9 projects EU, 6 th and 7 th FP –All types: IP, STREP, NoE; SSA, Dig. Libraries  Companions (IP) - ZČU, UK;  Khresmoi (IP) - UK  EuroMatrix, EuroMatrixPlus, Faust (STREP) - UK  Flarenet, META-NET (NoE) - UK  Clarin (SSA) - UK, MU, ÚJČ;  KYOTO (Dig. Libraries) - MU USA USA –Malach (till 2007; UK, ZČU): USC, JHU, IBM, UMD –PIRE: rozpoznávání řeči a strojový překlad (UK, indirectly ZČU): JHU, Brown Univ. –Discourse: Univ. of Pennsylvania –Treebanking: Univ. of Colorado → Check

21 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 21 EU Project „Companions“ Goal Goal –Intelligent conversational companion  Over photographs (Cz), „how was your day“ (En) Technologies Technologies –ASR, emotional TTS –Natural language understanding, NL generation –Naturalness of dialogue: „user studies“ / „evaluation“ CKL CKL –UK/ZČU: ASR, TTS, NLU, NLG, Dialogue management

22 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 22 The Companions project

23 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 23 Companions: System Diagram

24 Other project demos

25 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 25 Semantic annotation (UK) Některé kontury problému se však po oživení Havlovým projevem zdají být jasnější.

26 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 26 PDT 2.0: Annotation layers „Byl by šel do lesa“ (“he’d go to the forest”) Linked layers of annotation Stand-off annotation Scheme (Relax NG) z-layer

27 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 27 Speech reconstruction (UK, ZČU) Goal: „Translation“ ● Goal: „Translation“ SEM NEMOH SEM TO JIM DÁT TEN VOBRAZ ‘m couldn’t ‘m that them give the paintin’ Ten obraz jsem jim nemohl dát. I could not give them the painting. ? Generation ● Annotation

28 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 28 Speech Reconstruction Annotation Edited transcript Edited transcript –All changes allowed –Manual annotation –Large data  Malach data  Companions proj. dialogues (> 100h)

29 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 29 Acoustic modeling of inter-word context (ZČU) Usage: real-time close-captioning Usage: real-time close-captioning

30 Jan 31, 2011, ÚFAL MFF UK Center for Computational Linguistics (LC536) 30 Automatic translation Czech->sign language (ZCU) – Two sign languages in Czech Republic  Signed Czech –Artificial language, similar to Czech  Czech sign language –“Mother tongue” of the hearing-impaired –Used among those with hearing loss or difficulty –Dissimilar to Czech (or other NL):  homonymy, but easy context-based disambiguation  Use of the space in front of the “speaker”  Mimics (for “intonation”)


Download ppt "CKL --- Center for Computational Linguistics Project MŠMT LC536 (LC05) Univerzita Karlova v Praze, ÚFAL MFF Západočeská univerzita Plzeň, KKY FAV Masarykova."

Similar presentations


Ads by Google