Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University.

Similar presentations


Presentation on theme: "The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University."— Presentation transcript:

1

2 The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University

3 "The Games Corpus" - Agustín Gravano - Columbia University2 The Games Corpus 1. Design and Implementation 2. Annotation

4 "The Games Corpus" - Agustín Gravano - Columbia University3 The Games Corpus 1. Design and Implementation 2. Annotation

5 "The Games Corpus" - Agustín Gravano - Columbia University4 Experiment Design Goal: Study the relation between the down-stepped contour and Information status Syntactic position Discourse position Spontaneous speech Both monologue and dialogue

6 "The Games Corpus" - Agustín Gravano - Columbia University5 Experiment Design Three computer games. Two players, each on a different computer. They collaborate to perform a common task. Totally unrestricted speech.

7 "The Games Corpus" - Agustín Gravano - Columbia University6 Player 2 (Searcher) Player 1 (Describer) Cards Game #1   Short monologues Vary frequency and order of occurrence of objects on the cards.

8 "The Games Corpus" - Agustín Gravano - Columbia University7 Cards Game #2 Player 2 (Searcher) Player 1 (Describer)   Dialogue Vary frequency and order of occurrence of objects on the cards.

9 "The Games Corpus" - Agustín Gravano - Columbia University8 Objects Game Player 2 (Searcher) Player 1 (Describer)   Dialogue Vary target and surrounding objects (subject and object position).

10 "The Games Corpus" - Agustín Gravano - Columbia University9 Games Session Repeat 3 times: Cards Game #1 Cards Game #2 Short break (optional) Repeat 3 times: Objects Game Each subject participated in 2 sessions. 12 sessions

11 "The Games Corpus" - Agustín Gravano - Columbia University10 Subjects Postings: Columbia’s webpage for temporary job adds. Craig’s list http://www.craigslist.org Category: Gigs  Event gigs Problem: People are unreliable ~50% did not show up, or cancelled with short notice.

12 "The Games Corpus" - Agustín Gravano - Columbia University11 Subjects Possible solutions: Give precise instructions to e-mail ALL required info: Name, native speaker?, hearing impairments?, etc. Ask for a phone number. Call them and explain why it is so important for us that they show up (or cancel with adecuate notice). Increase the pay after each session. Example: $5, $10, $15 instead of $10, $10, $10.

13 "The Games Corpus" - Agustín Gravano - Columbia University12 Recording Sound-proof booth 2 subjects + 1 or 2 confederates. Head-mounted mics. Digital Audio Tape (DAT): one channel per speaker. Wav files One mono file per speaker. Sample rate: 48000 Downsampled to 16000 (but kept original files!) ~20 hours of speech  2.8 GB (16k)

14 "The Games Corpus" - Agustín Gravano - Columbia University13 Logs Log everything the subjects do to a text file. Example: 17:03:55:234BEGIN_EXECUTION 17:04:04:868NEXT_TURN 17:04:31:837RESULTS97 points awarded. 17:04:38:426NEXT_TURN 17:05:03:873RESULTS92 points awarded.... Later, this may be used (e.g.) to divide each session into smaller tasks or conversations.

15 "The Games Corpus" - Agustín Gravano - Columbia University14 The Games Corpus 1. Design and Implementation 2. Annotation

16 "The Games Corpus" - Agustín Gravano - Columbia University15 Speech Processing Tools Praat http://www.praat.org WaveSurfer http://www.speech.kth.se/wavesurfer Transcriber http://trans.sourceforge.net

17 "The Games Corpus" - Agustín Gravano - Columbia University16 Orthographic Tier - Method 1

18 "The Games Corpus" - Agustín Gravano - Columbia University17 Orthographic Tier - Method 1 Problems Very stressing Time consuming Separate transcription from alignment.

19 "The Games Corpus" - Agustín Gravano - Columbia University18 Orthographic Tier - Method 2 1. Transcribe chunks using a web interface.

20 "The Games Corpus" - Agustín Gravano - Columbia University19 Orthographic Tier - Method 2 1. Transcribe chunks using a web interface. 2. Align each chunk automatically. 3. Concatenate all chunks. 4. Correct the alignment by hand using Praat, Wavesurfer or similar.

21 "The Games Corpus" - Agustín Gravano - Columbia University20 Orthographic Tier - Method 2 Advantages Transcription task is very comfortable. Most of the alignment task is done automatically. Only fine-grain hand corrections are needed. Problems Overhead: chunking, automatic alignment, concat. Error prone! Easy for humans to overlook errors in the automatic alignment.

22 "The Games Corpus" - Agustín Gravano - Columbia University21 Orthographic Tier - Method 3 1. Transcribe the whole file, using: a regular audio player (e.g., Windows Media Player), and a regular plain-text editor (e.g., Notepad). 2. Use Wavesurfer to align the words. “Load text labels” function Check out: Spectrogram settings Customizable shortcuts

23 "The Games Corpus" - Agustín Gravano - Columbia University22 Orthographic Tier Transcription guidelines capital letters abbreviations disfluencies mmhm, uhhuh, gotcha, etc. Alignment guidelines boundaries http://www.cs.columbia.edu/~agus/games username/password = speech/lions

24 "The Games Corpus" - Agustín Gravano - Columbia University23 Too many cooks… Concurrency problem File locking webpage Annotators lock a file before working on it, and release it when done.

25 "The Games Corpus" - Agustín Gravano - Columbia University24 Annotation: Cue Words okay, mmhm, uhhuh, right, etc. Acknowledgment, Backchannel, Segment Beginning, Segment End, etc. Developed an ad-hoc application in Java. Bad idea!!! Too long development time. Instead, use Praat (or other general-purpose tool). For simple, specific tasks, Praat is not difficult to learn. Create a file with empty points at the middle point of the words that need to be labeled. Annotators only label those words, safely ignoring the rest.

26 "The Games Corpus" - Agustín Gravano - Columbia University25 Other Annotations Turn switches Smooth switches, interruptions, backchannels, etc. The labeler received a Praat file with empty turns. Prosody ToBI Labeling Conventions: Tones and Break Indices. Questions Identification, form and function.

27 "The Games Corpus" - Agustín Gravano - Columbia University26 Guidelines for Guidelines Web based (password protected) Highlight recent changes Avoid long lists: categorize, trees.

28 "The Games Corpus" - Agustín Gravano - Columbia University27 Files games/data/session_NN/sNN.GAME.P.Y.ext NN = 01..12 GAME = {cards, objects} P = 0..3 if GAME=cards, 0..1 if GAME=objects Y = {A, B} ext = {wav, words, tones, breaks, misc, turns, …}

29 "The Games Corpus" - Agustín Gravano - Columbia University28 Files Examples: games/data/session_08/s08.cards.3.B.wav s08.cards.3.B.words s08.cards.3.B.misc … s08.objects.1.A.wav s08.objects.1.A.words s08.objects.1.A.misc … games/data/session_11/…

30 "The Games Corpus" - Agustín Gravano - Columbia University29 Files Format All files (except *.wav) are saved as plain text, with the WaveSurfer format: Start End Value (for interval tiers) Time Value (for point tiers) Advantages Human-readable. Very easy to process. Problems Consistency Rounding

31 "The Games Corpus" - Agustín Gravano - Columbia University30 Files Format Other formats: XML General-purpose mark-up language. … Solves problems like consistency and rounding. Not human-readable, harder to process. Praat Not human-readable, hard to process. Also has the consistency problem.

32 "The Games Corpus" - Agustín Gravano - Columbia University31 Scripts So far, we have needed dozens of Perl scripts. Examples: Convert between Praat and WaveSurfer formats. Create a Praat file with empty CW labels, turns, etc. Find typos, missing labels, and other errors. Unify notation (e.g., “mm-hmm”  “mmhm”). Check consistency of files. …

33 "The Games Corpus" - Agustín Gravano - Columbia University32 Back-up! Back-up wav files only once (too heavy) in different places (DVD, 3+ computers). Back-up everything else (plain text: light) periodically, and automatically. Configure “cron” to make a backup copy every 8 hours.

34 "The Games Corpus" - Agustín Gravano - Columbia University33 Timeline Orthographic tier first! time design+implem. orthographic tier cue words prosody (ToBI) turn switches

35 The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University


Download ppt "The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University."

Similar presentations


Ads by Google