Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA INTEGRATION FOR LANGUAGE DOCUMENTATION

Similar presentations


Presentation on theme: "DATA INTEGRATION FOR LANGUAGE DOCUMENTATION"— Presentation transcript:

1 DATA INTEGRATION FOR LANGUAGE DOCUMENTATION
Under the guidance of :- Dr. Jan Chomicki & Dr. Jeff Good Presented By:- Sumit Agrawal

2 INTRODUCTION This project aims at integrating large amount of data spread across various files & folders and in different formats. The data is about 7-9 languages related to linguistics project undergoing in Cameroon. Data also contains metadata about the files.

3

4 DATA FORMATS Questionnaire data Data Available in different format
AudioVisual Audio recordings Video recordings Photographs Scanned images Textual Transcriptions (some time-aligned, XML) Unstructured text (various formats) Questionnaire data Lexical data (e.g., vocabulary items in a database) Metadata

5 CHALLENGES Each file should have a metadata, but it is not the case for every file. Some files don’t have the associated metadata. Each researcher has different format of writing the file. Different researchers sometimes interacted with the same people. More than 200 different file types.

6 AIM System which can query the data by:- - Author name - Speaker name
- Date and language name etc. E.g.-Records pertaining to language ‘Naki’. All the records of the date ‘ ’ Clean the data. Remove duplicates and build a database.

7 AIM Each file to be linked to its metadata.
Query the RDF data using SPARQL . Integration of database and file system. User interface development for queries. Know the density of data. Database Management

8 ORIGINAL DATA- FOLDERS

9 ORIGINAL DATA- FILES .

10 Parsing The files were parsed using python scripts.

11 INITIAL RESULT

12 CLEANING & LINKING The different data formats were identified .
The identified files were grouped based on file extensions . The related metadata for each file. e.g. language , date and extension were extracted. Duplicate files were identified. The unidentified files were grouped in a separate file. The identified files were linked to the existing metadata. Two types of metadata one which we extracted and the other which was provided.

13 AFTER CLEANING -RESULT
A sample of data constructed after cleaning and linking the data with metadata:- Naki wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki JCG.wav George Ngong NAKI-NOTEBOOK Jeff Good Naki wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki JCG.wav George Ngong NAKI-NOTEBOOK :26 Jeff Good Naki wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki JCG.wav George Ngong NAKI-NOTEBOOK :78 Jeff Good Naki wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki JCG.wav George Ngong NAKI-NOTEBOOK :914 Jeff Good

14 XML SCHEMA

15 RDF DatA

16 BUILD A RDF DATABASE USING SESAME
TRIPLES OF THE RDF MODEL

17 RDF GRAPH

18 Current GOALS Providing SPARQL querying ability for the RDF data.
Linking of the remaining metadata to the parsed metadata. Building database for unidentified file.

19 LONG TERm GOALS Create a multimedia server to store the whole data along with metadata as well as RDF data. Automated dumping of data in the repository. Building a user interface. Provide Linked Data for Sematic Web

20 THANK YOU!

21

22 REFERENCES http://www.w3.org/TR/rdf-schema/
Legal Disclaimer: All other products, company names, brand names, trademarks and logos are the property of their respective owners.


Download ppt "DATA INTEGRATION FOR LANGUAGE DOCUMENTATION"

Similar presentations


Ads by Google