Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mitglied der 1 DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and Retrieval of German Speech Corpora Joachim Gasch E-mail:

Similar presentations


Presentation on theme: "Mitglied der 1 DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and Retrieval of German Speech Corpora Joachim Gasch E-mail:"— Presentation transcript:

1 Mitglied der 1 DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and Retrieval of German Speech Corpora Joachim Gasch E-mail: gasch@ids-mannheim.de

2 Mitglied der 2 1.Introduction 1.1 The Collection of German Speech Corpora at the IDS 1.2 The Standardization Approach for cross-Corpus Information Management 2. The Online Navigation Platform 2.1 The Navigation Interface – Design Principals 2.2 The Visualization and Presentation of Speech Corpus Content 2.2.1 Generic Visualization of the XML Meta-Information of Speech Corpora 2.2.2 Transcript Visualization and Presentation 2.2.3 Media Presentation 3. Retrieval Strategies for unstructured and structured Speech Corpus Data Components 3.1 The Full-Text Search Module 3.2 XQuery Information Retrieval in structured XML Documents 4. Summary and Outlook

3 Mitglied der 3 1. Introduction 1.1 The Collection of German Speech Corpora at the IDS The IDS is hosting a wide range of historical and contemporary German speech corpora Many historical corpora can be (partially) accessed online via the Database for Spoken German (DGD) => Main objectives of the current DGD 2.0 project: Generic, cross-corpus approach to speech corpus management Normalized integration of historical and recent speech corpora Sustainability of speech corpus data components Object-oriented user interface (based on document structures) for corpus exploration and querying

4 Mitglied der 4 1.2 The Standardization Approach for cross-Corpus Information Management The speech corpus system manages meta-information of media source signals Different corpora: the information structures of data components may vary considerably due to different linguistic research questions, i.e. represented genres, degree of content restriction, physical data structure, research field (natural vs. elicited speech) => Web-based speech corpus navigation platform: Standardization concept: cross-corpus solution for large speech corpus collections rather than for particular speech corpus projects Definition of a generic, system-wide data model containing the following components (systematically interlinked): + structured XML documentation instances on corpus-, event- and speaker level + unstructured, semi-structured or structured transcripts (time aligned, multi-dimensional) + media source files + optional: unstructured secondary documents

5 Mitglied der 5 Interlinked components of the normalized speech corpus data model

6 Mitglied der 6 2. The Online Navigation Platform 2.1 The Navigation Interface - Design Principals Object-oriented, document-centric interaction paradigm: based on document structures to be managed by the system Provision of adaptive views of speech corpus data components => The application menu: Flat structure of the navigation menu Fixed position at the top of the screen Permanent, homogeneous acces to application components Indication of flat / hierarchically subdivided menu entry points by the symbols and

7 Mitglied der 7 => Classifying icons Intuitive user orientation by marking specific types of corpus data components with their correspondent icons: => bread crumb navigation: Help the user to identify his current position in the navigation tree

8 Mitglied der 8 2.2 The Visualization and Presentation of Speech Corpus Content 2.2.1 Generic Visualization of the XML Meta-information Native XML database storage of documentation instances Use of generic XML rendering module to avoid corpus specific instance visualizations, providing: + expandable / collapsible document nodes + node level selection functionality + direct access to hyperlinks => The cross-corpus (single coprus independent) display method of corpus-, event and speaker documentation offers an ergonomic navigation experience (especially for large data-centric XML instances)

9 Mitglied der 9 Generic XML document rendering

10 Mitglied der 10 => Documentation of geocodes: The geographic coordinates of event locations may be documented in specific speech corpus projects A geographic map can be displayed on demand: the example shows the geographic map for the event DH--_E_00167 (with geographic latitude 47.423336 and longitude 9.377225 ) which took place in St. Gallen (Switzerland)

11 Mitglied der 11 Geographic map (based on documented geocodes showing the event location)

12 Mitglied der 12 2.2.2 Transcript Visualization and Presentation For larger speech corpus collections, a common concept of transcript becomes fuzzy: + Annotation of distinct phenomena + Use of heterogeneous (transcript editor specifc) data formats Historical speech copora: + Unstructured transcript data formats (only layout oriented) Contemporary speech corpora: + Use of annotation tools available nowadays: structured data formats but no cross-corpus structure homogeneity Cross-corpus visualization is possible for the transcript-related part of the event documentations via menu point Transkripte (corpus specific transcript access lists)

13 Mitglied der 13 Corpus-specific transcript list for the speech corpus DS

14 Mitglied der 14 2.2.3 Media Presentation Speech corpora may include different types of interdependent media files: + One event is related to one or more source files: the raw material recorded for an event (originating directly from an audio device) + An event can be composed of several speech events: further segmentation of the source files into speech event specific recordings All relevant information regarding different media file types is maintained in the meta-documentation of the corresponding event and can be accessed via the list of the menu point Aufnahmen

15 Mitglied der 15 Corpus-specific list of source recordings for the speech corpus DH

16 Mitglied der 16 3. Retrieval Strategies for unstructured and structured Speech Corpus Data Components Media file content can only be located via descriptive meta- information: + meta data (schema valid XML instances) + transcript data (unstructured, semi-structured, structured) Transcript data of speech corpus collections is spreading regarding the structuring degree Retrieval strategies depend on this degree: from simple full-text search to complex layer-aware query processing Single corpus transcript incompatibilities (worst case scenario): + Signal segmentation without precise segmentation guidelines (i.e. phones, words, phrases or turns) + No or not sufficient naming conventions applied for the different transcript layer descriptors (i.e. no unique descriptor used for orthographic transcription layer) + No exact semantic layer definition available or semantic mix-up of layer content (i.e. mix-up of orthographic and phonetic markup in one single layer) + No exact syntactic definition of layer content available or syntactic mix-up of layer content (i.e. mix-up of punctuation- or capitalization conventions in the orthographic layer) + Violation of cross-layer time relations (i.e. caused by interval changes that were made with multi-layer transcript editors without layer inheritance control)

17 Mitglied der 17 3.1 The Full-Text Search Module No structured data is required (but can be optionally included) Advantages: short query response times, easy user interface handling The full-text search functionality is implemented using Oracle Text Examples of the provided full-text query features: + The simple and multiple wildcard characters "_" and "%": _ind matches i.e. "Kind" and "Wind %wind matches i.e. "Nordwind" or Südwind + The operators AND and OR build logical relations between search terms: Nordwind AND Südwind matches only documents with occurrences of both terms + Tthe NOT operator excludes a specific search term: Nordwind NOT Südwind matches only documents containing "Nordwind" but not containing "Südwind + The NEAR operator finds documents depending on the word distance of search terms: NEAR((Schule, Kirche, 4, true) matches documents where both search terms occur with a (maximum) word distance of 4 words.

18 Mitglied der 18 Full-text search in semi-structured transcript data with search results (KWIC-list)

19 Mitglied der 19 3.2 XQuery Information Retrieval in structured XML Documents The full-text search option is not sufficient for the retrieval in fine- grained XML instances (like meta data or time aligned multi-dimensional transcripts) XQuery allows the implementation of context-sensitive queries for the hierarchical interdependent informational units of XML structured data: + criteria-specific information selection and filtering + joining of data from document selections + sorting, grouping, aggregating, transforming and restructuring of data + arithmetic calculations on numbers and dates Powerful queries can be defined but a detailed knowledge about the underlying information structures is necessary => Two different approaches for the implementation of Web-based XQuery retrieval interfaces: + HTML form with a graphical representation of the XML tree (easy to use but limited flexibility for query definition) + HTML form providing a text area field to enter the XQuery as plain text (intended for system experts only, also complex queries on data centric instances or cross-structural joins are possible)

20 Mitglied der 20 HTML form providing a graphical XQuery composition interface

21 Mitglied der 21 HTML form for XQuery plain text submission

22 Mitglied der 22 4. Summary and Outlook Media source files become analyzable via their appropriate meta- information Contemporary speech corpus systems have to close the gap between the processing of binary media data and related meta-information The need for standardization of speech corpus components is commonly accepted But: the identification of all necessary parameters for a cross-corpus standardization still remains an outstanding goal Future evolving technologies like the MPEG-7 standard might provide appropriate logic to achieve the standardized integration of the different audiovisual information types (potentially involved in media corpora): + Audio+ Voice + Video+ Images + Graphs+ 3D models => Questions? Suggestions?


Download ppt "Mitglied der 1 DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and Retrieval of German Speech Corpora Joachim Gasch E-mail:"

Similar presentations


Ads by Google