Presentation on theme: "Digital Italian An overview of Italian corpora. A linguistic corpus: a body of texts / transcripts collected for linguistic purposes, computerized, representative."— Presentation transcript:
Digital Italian An overview of Italian corpora
A linguistic corpus: a body of texts / transcripts collected for linguistic purposes, computerized, representative for the variety studied, balanced, annotated.
Annotation Linguistic annotation can be useful or restrictive Extra-linguistic annotation useful for sociolinguistic research
Italian corpora General Written Diachronic Specialized Spoken Synchronic
General corporaWritten Italian Corpus e lessico di frequenza dellitaliano scritto (COLFIS) Corpus di riferimento dellitaliano scritto / Corpus dinamico dellitaliano scritto (CORIS/CODIS)
COLFIS - structure COLFIS (over three and a half million words) NewspapersPeriodicalsBooks Il Corriere della Sera La Repubblica La Stampa Other, arts, science and technology, cars and boats, children and youngsters, home and hobby, womens magazines, photo love story, general information, society, radio and television, sport, travels and ecology. Other, arts, children, SF, detective and spy stories, hobby and travel, classics, modern narrative, romance, essays, natural and exact sciences, human and social sciences, theatre and poetry. Economy, news of local interest, society, crime news, internal / external affairs, science, show biz and sports.
CORIS/CODIS – structure CORIS / CODIS (one hundred million words) PressFictionAcademic Prose Legal and Administrati ve Prose Miscella -nea Epheme- ra Newspaper, periodical, supplement Novels, short stories Human sciences, natural sciences, physics, experimental sciences Legal, bureaucratic, administrative Books on religion, travel, cookery, hobbies, etc. Letters, leaflets, instruction National, local/ specialist, non- specialist / connotated, non- connotated Italian, foreign, for adults, for children, crime, adventure, SF, women literature Books, reviews, scientific, popular history, philosophy, arts, literary criticism, law, economy, biology, etc. Books, reviews Private, public/ Printed form, electronic form
General corporaSpoken Italian Lessico di frequenza dellitaliano parlato (LIP) -> Bancadati dellitaliano parlato (BADIP). Archivio delle varietà dellitaliano parlato (AVIP). LABLITA
Spoken and written Italian: Corpora e lessici dellitaliano parlato e scritto (CLIPS) CLIPS (the spoken corpus) Radio and television speech Field recordings ReadingsTelephone speech Entertainment, informative transmissions, cultural and educational transmissions, commercials. Map task dialogues and spot the difference game. Readings by the speakers themselves or by professional dubbing actors. Conversations between a fake tour-operator and three hundred people.
Specialized corpora Corpus di italiano televisivo (CIT) La Repubblica
CIT – structure CIT Current affairs Entertain ment (games, talk-show, varieties) Commer- cials Sports newsNewscast Com- menta -ries. Play- by- play Studio broadcast. On-field broadcast. TextText. Slogans. Studio broad- cast On- field broad- cast TextHeadlines. Studio broadcast. On-field broadcast
Corpus di italiano televisivo
La Repubblica – structure La Repubblica Year GenreNews Comment TopicReligion Culture Economics Education News Politics Science Society Sport Weather Unclassified
Thank you! Anne-Marie OBRETIN Mres in European Languages and Cultures University of Exeter