Presentation is loading. Please wait.

Presentation is loading. Please wait.

DASISH Digital Services Infrastructure for Social Sciences and Humanities Daan Broeder TLA - MPI for Psycholinguistics / DASISH & CLARIN EGI Forum Garching,

Similar presentations


Presentation on theme: "DASISH Digital Services Infrastructure for Social Sciences and Humanities Daan Broeder TLA - MPI for Psycholinguistics / DASISH & CLARIN EGI Forum Garching,"— Presentation transcript:

1 DASISH Digital Services Infrastructure for Social Sciences and Humanities Daan Broeder TLA - MPI for Psycholinguistics / DASISH & CLARIN EGI Forum Garching, March 29 2012

2 DASISH Origin FP7 Capacities Work Programme: Infrastructures INFRA-2011-2.3.1: Implementation of common solutions for a cluster of ESFRI infrastructures in the field of "Social Sciences and Humanities". A project under this topic should implement harmonised solutions for the ESFRI Infrastructures in the field of Social Science and Humanities on issues like, for example metadata frameworks, registries, single-sign-on systems and permanent identifiers.

3 DASISH consortium I 18 partners from 10 countries EU + Norway 5 ESFRI infrastructures: CESSDA, CLARIN, DARIAH, ESS and SHARE DASISH budget: 6ME -> 700PMs Started January 2012 Duration 36 Months Build on already existing collaborations – Partners are part of multiple research infrastructures – ESFRI projects merging (proposals) e.g. CLARIN + DARIAH - > CLARIAH

4 Council of European Social Science Data Archives An umbrella organisation for social science data archives across Europe. Since the 1970s the members have worked together to improve access to data. CESSDA research and development projects and Expert Seminars enhance exchange of data and technologies among data organisations. 20 CESSDA member organisations serve some 30,000+ social science and humanities researchers and students each year, Documents, recordings, statistical data & surveys: demographics, health, economy, education, politics, … Developers DDI metadata std. Multiple data centers

5 Common Language Resources and Technology Infrastructure CLARIN with 193 member institutions is a large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily usable. is committed to establish an integrated and interoperable research infrastructure of language resources and its technology. aims at lifting the current fragmentation, offering a stable, persistent, accessible and extendable infrastructure and therefore enabling eHumanities - Language Resources: text & multi-media corpora and lexica, … - Language Technology: Parsers, tokenizers, speech recognizers, … - Multiple CLARIN Centers (±30) ERIC status since Feb 2012

6 Digital Research Infrastructure for the Arts and Humanities The mission of DARIAH is to enhance and support digitally-enabled research across the humanities and arts. DARIAH aims to develop and maintain an infrastructure in support of ICT-based research practices. It has 14 partners and 5 associate partners DARIAH is working with communities of practice to: Explore and apply ICT-based methods and tools to enable new research questions to be asked and old questions to be posed in new ways Improve research opportunities and outcomes through linking distributed digital source materials of many kinds Exchange knowledge, expertise, methodologies and practices across domains and disciplines -Wide variety of data types for all the SSH disciplines -Services: statistics, visualization (maps), NLP, … -Virtual Competence Centres

7 The European Social Survey An academically-driven social survey designed to chart and explain the interaction between Europe's changing institutions and the attitudes, beliefs and behaviour patterns of its diverse populations. - Survey oriented - Single data centre, multiple competence centres

8 The Survey of Health, Ageing and Retirement in Europe A multidisciplinary and cross-national panel database of micro data on health, socio-economic status and social and family networks of more than 45,000 individuals aged 50 or over. The survey’s third wave of data collection, SHARELIFE, collects detailed retrospective life-histories in thirteen countries in 2008-09. - Survey oriented - Single data center ERIC status since March 2011

9 Consortium II SND DANS UEssex FSD NSD GESIS CITYKCL UGOE OEAW MPIPL UCPH UIB UPF NUIM MPISOC UNIVE CentERdata UT CESSDA DARIAH CLARIN ESS SHARE

10 Management

11 1 UGOT44 2Architecture and Quality assessment DANS Liaise with other e-infra initiatives Get requirements for a ref. architecture Assess results 83 3Data QualityNSDImprove EU wide survey quality: terminology, translation, vocabulary normalizations 200 4ArchivingNSD State of preservation in SSH Assessment of deposit services recommendations, negotiations. Deposit service convergence 67 5Data Access and EnrichmentMPI-PL Federated Identity PID requirements Metadata quality improvement Joint metadata domain Workflow use cases Annotation framework 171 6Legal and Ethical IssuesMPI-SOC identify legal and ethical issues wrt. current and new SSH data types resulting from the integration, linking and archiving Legal & ethical VCC 68 7Education and TrainingUGOETraining modules, workshop program56 8DisseminationUCPHCommunication strategy and means34 DASISH Work packages

12 DASISH Mission DASISH provides and or brokers solutions for a number of common issues of the five ESFRI projects in social sciences and humanities. DASISH identifies four major areas: data quality (surveys) ESS & SHARE data archiving data access legal and ethical issues General procedure: Inventory Analysis Brokering Implemen tation Education Outreach

13 Need to create common infrastructure not just strengthen community specific ones Traditions vary considerably – Between SS at one side and the humanities. But also within the humanities – Some collaborations/communities have a rich history – Others are fairly new (as an infrastructure) – Organizational models and complexity varies and impacts preferences for solutions – Differences wrt. understanding IT issues Language and terminology vary even more so as past discussions learned us DASISH Challenges

14 Highly domain specific Can be shared by communities as CESSDA, SHARE and ESS but probably not outside DASISH DASISH WP3 plans – Questionnaire Design Documentation Databank – Translation Tool and Databank – Question Databank – (Survey) Fieldwork monitoring system Data Quality

15 Is a generic service possible? Archiving policies differ and probably necessarily so – Allowed archivable formats, retention time?, required reliability, security assurances,… Any common system should accommodate multiple policies Existing solutions: National data centers – But are they available to all? – Sufficiently flexible EUDAT data management infrastructure is in progress and can be (part of) a generic solution. Data Archiving

16 Metadata quality improvement – Controlled set of schema -> schema registry – Controlled vocabularies -> vocabulary services – Explicit semantics for schema -> semantic/concept registries Single metadata catalogue – Metadata interoperability -> semantic/concept registry – Well defined metadata harvesting infrastructure -> OAI/PMH + metadata provider registry – Granularity is an issue! Solutions CESSDA and CLARIN already have catalogs, others also exist CLARIN claims general framework for MD interoperability EUDAT is working on a shared catalogue Technology is available e.g. US DataOne project using Mercury Access - Data Discovery

17 Goal: single sign-on and single user identity Many users -> maintaining separate user store is prohibitive Limited complexity for the user, e.g. no certificate handling by users Solutions Federated Identity Management offered by the national IDFs and use of SAML2 should be sufficient if also – EU GEANT/eduGain inter-federation works – Proper set of user attributes is released There are legal and practical issues concerning the user’s institute release policies for user attributes Current attribute release policies don’t scale well Access - AAI

18 Persistent Identifiers for data Could be generic, but tradition splits SSH institutes in using URNs or Handle/DOI Support for identifying parts of objects. Added functionality in cooperation with other e-infra projects as EUDAT Important extra functionality can be associated with PID framework e.g. checksum for verifiability Solutions EPIC, DataCite (hdl) Persid (URN) Access - PIDs

19 Annotation framework Create relations between (parts-of) on-line data objects Need special data-type specific visualization when linking to parts-of data objects Need registry to store and access these relations Solutions Some but only allowing to link to complete objects via URLs or else markings on browser screen dump e.g. “everNote” Access – linking data

20 Only few applications in SSH need intensive computing This will change in the future as more automatic feature extraction will be done on media recordings. Some SSH ESFRI projects are developing SOA with complex workflows to process data. Facilities for cheap flexible deployment of services is asked – Organizational & management problem – Not a computational resources problem Computing

21 PID services – EPIC / Persid SSH communities wide - DASISH common SSH metadata catalog community specific community specific CLARIN LT web service infrastructure NETWORK Services - GEANT Federated Identity Management Data Preservation – EUDAT replication & preservation DASISH Context CLARINDARIAHCESSDALife Watch DASISH

22 Thank you for your attention

23 Annotation Frame work item1 itemZ

24 Possible use cases I These have not yet been fleshed out. Some ideas from the CLARIN side: Social scientists have recordings that are of interest to linguists. – Locate these using appropriate metadata and process it with LT tools to analyze gesture – The analysis results should be (after evaluation) deposited into an archive with proper references to the primary data – The analysis data should again be registered with proper metadata for reuse Use of demographic data for corpus building and use – Give a linguist building a balanced speech corpus access to demographic data – How many of the speakers need to be older than 65 for the corpus to be a representative sample

25 Possible use cases II Combine maps of linguistic dialects or linguistic micro- variation with migration statistics. – Looking at variation both in place and time should be interesting Make metadata on medieval texts & literature available on the web and interlink it with manuscripts and transcripts available from cultural heritage institutions – Have possibility to add enhancements and comments from the research community Give historians of science and ideas access to language technology to analyze historical texts – Allows following the appearance and spread of new concepts and inventions – The Dutch CCKC project used this to analyze the correspondence between scientists in the 17 th century


Download ppt "DASISH Digital Services Infrastructure for Social Sciences and Humanities Daan Broeder TLA - MPI for Psycholinguistics / DASISH & CLARIN EGI Forum Garching,"

Similar presentations


Ads by Google