Presentation is loading. Please wait.

Presentation is loading. Please wait.

LR College Maribor: 9 th ECESS meeting 1.Goal of meeting 2.Status members of College 3.Interests and acceptance of associated members Activities of Microsoft.

Similar presentations


Presentation on theme: "LR College Maribor: 9 th ECESS meeting 1.Goal of meeting 2.Status members of College 3.Interests and acceptance of associated members Activities of Microsoft."— Presentation transcript:

1 LR College Maribor: 9 th ECESS meeting 1.Goal of meeting 2.Status members of College 3.Interests and acceptance of associated members Activities of Microsoft Portugal concerning LR 4. College-Action List of VIII meeting 9th ECESS Meeting College Language Resources Maribor 27 June 2007

2 LR College Maribor: 9 th ECESS meeting 5. Status and further plans of partners Pronunciation lexica (Pool Lex1-PL1; Pool Lex2-PL2) Validation. Distribution. Acoustic data for TTS voices (Pool Voice1-PV1, Pool Voice2-PV2) Non standard acoustic data. Validation. Distribution. Text Corpora (Pool Text1-PT1, Pool Text2-PT2). Settling the specification for POS tagging 6. Action List of IX Meeting of LR

3 LR College Maribor: 9 th ECESS meeting 1. Main Goals Status and further plans of partnersStatus and further plans of partners Interests and acceptance of associated membersInterests and acceptance of associated members Settling the specification for POS taggingSettling the specification for POS tagging

4 LR College Maribor: 9 th ECESS meeting 2. Status members of College AMU University of Poznan (Coordi n ator Grażyna Demenko ) Siemens (Harald Höge) Middle East Technical University, Ankara (Tolga Çiloğlu) CAS (Jinhua Tao) Associates and Observers: Nokia (Imre Kiss) ATR (Nick Campbell) Microsoft Portugal (Daniela Braga) CNRS Aix en Provence (Daniel Hirst)

5 LR College Maribor: 9 th ECESS meeting Interests and acceptance of associated Members and Observers 3. Interests and acceptance of associated Members and Observers Activities of Microsoft Portugal concerning LR Daniela Braga

6 LR College Maribor: 9 th ECESS meeting 4. College Action of 8 th Meeting Nokia (Imre) to collect feedback from partners concerning non-standard lexica and acoustic databases. Result: text for TA. Siemens finds references to PennTree tagging format Siemens (Ute) to ask LC-STAR partners if they can make the LC-STAR LSPs available for ECESS partners Siemens (Ute) to make a proposal about what POS covers UAM Poznan to collect information from all partners for text corpora specs.

7 LR College Maribor: 9 th ECESS meeting 5. Status of partners concerning standard lexicon/acoustic data UMB: SI lexicon validated, SI baseline voice ready by 10/2007 UPC: CA lexicon validated, CA baseline voice TC-STAR compliant, but not validated UAM: PL lexicon ready for validation (expected to be ready by 11/2007) Siemens: UK lexicon (10/2007), UK baseline voice validated Nokia: CN lexicon ready, baseline voice currently in validation Exchange of standard LR around Oct-Dec 2007when all listed resources are validated

8 LR College Maribor: 9 th ECESS meeting

9 Design Principles of the Acoustic Corpora Size of corpus 10 h speech per baseline speaker per language ‘Baseline Text Corpus’ is composed by the corpora** Transcribed speech 45 000 words Written text (novels and short stories with short sentences) 27 000 words Selected phrases (frequent phrases, triphone sentences, mimic sentences) 18000 words 2 pools for lexica and acoustic data (TC/LC-STAR and minimum requirements to be worked out in the project). Minimal requirements acoustic data. Coordinated by University of Munich

10 LR College Maribor: 9 th ECESS meeting Text corpus specifications (for POS tagging ) Size of corpus: Expected size of text data: 100K tokens minimum, 100% manually checked rest (500K-1M) can be done automatically Domains: Mandatory: 20K should be coming from spoken transliterations Preferred: in line with the TC-STAR text corpora (in line with acoustic data creation) TC-STAR text corpus as basis for POS tagging (90Kwords) LC-STAR tag set, or comparable, but tag set in lexicon and tagged text corpus must match

11 LR College Maribor: 9 th ECESS meeting Discussion POS tagging Size of text, domains Tokenization problems POS tagging sets Format of POS tagging Validation

12 LR College Maribor: 9 th ECESS meeting 6.Action List of IX Meeting of LR Finalizing specifications for LR – voice database: non-standard PV2 Pool Finalizing specifications for Text Corpora POS: PT1, PT2 Pool Lexicon: PL1, PL2 Pool final documentation, reports of validation – end of 2007 ( internal ECESS pages)


Download ppt "LR College Maribor: 9 th ECESS meeting 1.Goal of meeting 2.Status members of College 3.Interests and acceptance of associated members Activities of Microsoft."

Similar presentations


Ads by Google