Presentation is loading. Please wait.

Presentation is loading. Please wait.

Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

Similar presentations


Presentation on theme: "Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,"— Presentation transcript:

1 Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland, UK Seminar: Data management in the social sciences and the contribution of the DAMES Node Stirling 31 January 2012 DAMES: Data Management through e-Social Science http://www.dames.org.uk

2 2 DAMES: Background  DAMES: Case studies, provision and support for data management in the social sciences  This talk: focusing on "support for data management"  Infrastructure/tools  Driven by social science needs for support for advanced data management operations  “In practice, social researchers often spend more time on data management than any other part of the research process” (Lambert)  A ‘methodology’ of data management is relevant to ‘harmonisation’, ‘comparability’, ‘reproducibility’ in quantitative social science

3 3 DAMES: Themes  Enabling the (social science) researcher:  To deposit, search and process heterogeneous data resources  To access online services/‘tools’ that enable researchers to carry out repeatable and challenging data management techniques such as: fusion matching imputation …  Facilitating access is an important goal  Underlying computer science research themes  Metadata  Data curation  Data management/processing  Portals

4 4 Data management/processing scenarios  Curation scenarios include:  Uploading occupational data to distribute across academic community  Recording data properties prior to undertaking data fusion involving a survey and an aggregate dataset  Fusion scenarios include:  Linking a micro-social survey with aggregate occupational information (deterministic link)  Enhancing a survey dataset with ‘nearest match’ explanatory variables (probabilistic link)  Other processes: recoding, operationalising, linking, cleaning…

5 5 Generic data flows Data set store Processing Data sets are deposited Data sets are selected Processing is configured Data set selection, and the configuration of processing jobs must be informed by knowledge about the data sets - metadata Result is saved

6 6 Key role for metadata  Metadata records are absolutely core to the functioning of the portal infrastructure  For adequate, searchable records for the heterogeneous resources (data tables, command files, notes and documentation)  To connect the resources and the data mgmt tools  To document the data sets resulting from application of the data mgmt tools: inputs, process, rationale,…  DAMES requirements:  (Micro-)data based, very general  DDI (= Data Documentation Initiative)

7 7 DDI 2 – An XML language An interesting study 12 DAMES Portal Univ of Stirling July 29, 2010 <ddi2:grantNo source=" Financial_1 " agency=" Economic and Social Research Council "> RES-149-25-1066...

8 8 The metadata "cycle" Processing Metadata Search Data is mirrored by metadata Configure/ process Select Deposit/curate

9 9 DAMES portal architecture overview Portal DAMES Resources External Dataset Repositories User Services Search Enact Fusion File Access Compute Resources Metadata Local Datasets (Note: Security omitted)

10 10 Tools  Since metadata must have a key role in data management…  So tools for managing and exploiting the metadata have key role in the use and operation of the DAMES portal  At deposit/curation  For searching  For informing the configuration of processing steps  The following slides illustrate use of our tools

11 11 Curation Tool The source data:

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24 Also automatically uploaded to searchable eXist database

25 25 Metadata searching

26 26 Browsing the search results

27 27 Fusion Tool prototype  Scenario: A soc sci researcher wishes to fuse Scottish Household Survey data with privately collected study data:  Uses the data curation tool to upload the data  Uses the data fusion/imputation tool to select the data, identify corresponding variables, and to generate a derived dataset (held in the portal)  The metadata about this derived dataset is stored and (may be) made public through the portal  Another researcher can now search the portal (metadata) for SHS data and find the derived dataset  DAMES metadata handling must facilitate this process

28 28 The Fusion Tool prototype Select datasets (recipient and donor) Select "common variables" Select variables to be imputed Select data fusion method Submit to fusion "enactor" Metadata accessed

29 29 Select datasets (recipient and donor) Select "common variables" Select variables to be imputed Select data fusion method Submit to fusion "enactor" Metadata accessed

30 30 Select datasets (recipient and donor) Select "common variables" Select variables to be imputed Select data fusion method Submit to fusion "enactor" Skipped Metadata for result dataset

31 31 Job submission: Information flow Wizard Enactor Compute resources (Condor) subjob1 subjob2 User's local file store Resultant data DDI record notify (job id) fetch job submit JFDL/JSDL description.xml Further infra- structure

32 32 Fusion job flow description  We use a Job Flow Description Language (JFDL) to submit the job to the computing resources pool  The JFDL job description includes references to:  Input data sets  Processing steps and their relationships  Outputs

33 33 JSDL/JFDL DAMES::Fusion............ … A brief extract!

34 34 Technology – other components  Liferay portal  eXist  XML based database – ideal for storing DDI metadata  Condor  Job management  iRODS  Highly flexible filestore  Capable of running automated processes on file upload: e.g. metadata extraction (e.g. STATA files), JFDL → DDI translation, & transfer from file store to metadata store

35 35 Thank you!


Download ppt "Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,"

Similar presentations


Ads by Google